Loading content...
The standard GAM assumes additive effects: $f(\mathbf{x}) = \alpha + \sum_j f_j(x_j)$. This is powerful, but it's also limiting. What if the effect of temperature on plant growth depends on rainfall? What if we have repeated measurements on patients? What if we need to model spatial data?
The GAM framework is remarkably extensible. The same penalized regression machinery that fits additive models can be adapted to handle interactions, random effects, varying coefficients, spatial fields, and more. These extensions preserve interpretability while dramatically expanding what we can model.
By the end of this page, you will understand tensor product smooths for modeling interactions, generalized additive mixed models (GAMMs) for hierarchical data, varying coefficient models, spatial and temporal GAMs, and practical guidance on when to use which extension.
The additive model by definition excludes interactions: the effect of $x_1$ is the same regardless of $x_2$. But many real phenomena involve interactions:
Mathematical formulation of interaction:
An interaction means the partial derivative of the response with respect to $x_1$ changes as $x_2$ varies:
$$\frac{\partial^2 f}{\partial x_1 \partial x_2} eq 0$$
Additive models have $\frac{\partial^2 f}{\partial x_1 \partial x_2} = 0$ by construction—no interaction by assumption.
Detecting interactions:
Before adding interaction terms, test whether they're needed:
Adding unnecessary interactions wastes degrees of freedom and complicates interpretation. Add them only when justified.
Interaction terms sacrifice the clean 'one plot per feature' interpretation of pure additive models. Before adding interactions, ensure the gain in fit justifies the loss in simplicity. Sometimes a well-fitting additive model is more useful than a slightly better-fitting model with interactions.
Tensor product smooths are the standard way to model smooth interactions in GAMs. They generalize the additive structure to allow 2D (or higher) surfaces.
Construction:
Given univariate basis functions ${\phi_k(x)}{k=1}^{K_x}$ for $x$ and ${\psi_l(z)}{l=1}^{K_z}$ for $z$, the tensor product basis is:
$${\phi_k(x) \cdot \psi_l(z)}_{k=1, l=1}^{K_x, K_z}$$
This spans functions of two variables:
$$f(x, z) = \sum_k \sum_l \beta_{kl} \phi_k(x) \psi_l(z)$$
The dimension is $K_x \cdot K_z$—the product of marginal dimensions.
Tensor product penalty:
The penalty combines marginal penalties to control smoothness in each direction:
$$\lambda_x \int \left( \frac{\partial^2 f}{\partial x^2} \right)^2 dx , dz + \lambda_z \int \left( \frac{\partial^2 f}{\partial z^2} \right)^2 dx , dz$$
Two smoothing parameters ($\lambda_x$, $\lambda_z$) allow differential smoothness in each direction—useful when one variable is expected to have a smoother effect than the other.
Notation in model specification:
| Notation | Meaning |
|---|---|
s(x, z) | Isotropic smooth (same scale in both directions, e.g., thin plate) |
te(x, z) | Tensor product smooth (different scales, separate smoothness penalties) |
ti(x, z) | Tensor product interaction (excludes main effects) |
t2(x, z) | Alternative tensor formulation (older) |
Use te(x, z) when x and z are on different scales or differ in expected smoothness. Use s(x, z) for isotropic smooths when both variables are comparable (e.g., latitude and longitude). The te() function is more flexible and generally preferred for interactions.
A full tensor product $f(x, z)$ contains main effects and interactions confounded together. For interpretation, we often want to separate:
$$f(x, z) = f_x(x) + f_z(z) + f_{xz}(x, z)$$
where $f_x$ and $f_z$ are main effects and $f_{xz}$ is the pure interaction.
The ti() function:
The ti() function (tensor interaction) constructs a smooth that is orthogonal to the main effects—it captures only the interaction pattern. Model specification:
y ~ s(x) + s(z) + ti(x, z)
This gives:
s(x): Marginal effect of $x$, averaged over $z$s(z): Marginal effect of $z$, averaged over $x$ti(x, z): How the effect of $x$ changes with $z$ (and vice versa)| Model | Structure | Interpretation |
|---|---|---|
te(x, z) | $f(x, z)$ (single surface) | Combined effect, hard to decompose |
s(x) + s(z) | $f_x(x) + f_z(z)$ | Additive main effects, no interaction |
s(x) + s(z) + ti(x, z) | $f_x(x) + f_z(z) + f_{xz}(x, z)$ | Separated main effects and interaction |
s(x, by=group) | Different $f_x(x)$ per group | Group-varying effects (see later) |
With the s(x) + s(z) + ti(x, z) specification, the p-value for ti(x, z) directly tests the null hypothesis of no interaction. This allows testing: 'Given main effects, is there significant interaction?'
Generalized Additive Mixed Models combine GAMs with random effects from mixed models. They handle hierarchical/clustered data where observations within groups are correlated.
Motivation:
Without accounting for clustering, standard errors are too small and p-values are misleading.
GAMM formulation:
$$y_{ij} = \alpha + \sum_k f_k(x_{ijk}) + b_i + \epsilon_{ij}$$
where:
Types of random effects in GAMMs:
| Type | Notation | Interpretation |
|---|---|---|
| Random intercept | s(group, bs='re') | Each group has its own baseline |
| Random slope | s(x, group, bs='fs') | Each group has its own smooth of x |
| Random smooth | s(x, by=group, bs='fs') | Each group has its own function |
| Spatial random field | s(lat, lon, bs='gp') | Correlated random effects in space |
Implementation notes:
bs='re' creates a random effect (Gaussian prior on coefficients)bs='fs' is 'factor smooth' for group-varying curvesIn the penalized likelihood framework, smooth terms and random effects are mathematically equivalent—both are Gaussian penalties on coefficients. This unification means GAMMs can be fit with the same machinery as GAMs. The smoothing parameter for a random effect equals 1/variance of that effect.
Varying coefficient models allow the effect of one variable to change smoothly as a function of another. This is a specific type of interaction with an interpretable structure.
Formulation:
$$y = \alpha(z) + \beta(z) \cdot x + \epsilon$$
Both the intercept $\alpha$ and the slope $\beta$ can vary smoothly with $z$. This is equivalent to:
$$y = f_0(z) + f_1(z) \cdot x$$
Use cases:
Implementation:
Using the by= argument:
y ~ s(z) + s(z, by=x)
Here:
s(z) is the varying intercept $f_0(z)$s(z, by=x) is the varying slope $f_1(z) \cdot x$Interpretation:
The varying coefficient plot is directly interpretable as 'how does the marginal effect of $x$ change with $z$?'
Varying coefficient: assumes the interaction has the form f(z)·x (effect of x scales smoothly with z). Tensor product: fully flexible 2D surface with no structural assumption. Varying coefficients are more interpretable when the structure is appropriate; tensor products are more flexible when it's not.
Spatial data requires special treatment: observations close in space tend to be correlated. GAMs handle spatial structure through smooth functions of coordinates.
Basic spatial smooth:
$$y_i = \alpha + f(\text{lat}_i, \text{lon}_i) + \boldsymbol{x}_i^\top \boldsymbol{\beta} + \epsilon_i$$
The 2D smooth $f(\text{lat}, \text{lon})$ captures spatial variation not explained by covariates.
Implementation options:
| Basis | Code | Properties |
|---|---|---|
| Thin plate | s(lat, lon, bs='tp') | Isotropic smooth, good default |
| Tensor product | te(lat, lon) | Different smoothness by direction |
| Gaussian process | s(lat, lon, bs='gp') | Explicit spatial correlation |
| Soap film | s(lat, lon, bs='so') | Respects region boundaries |
Distinguishing spatial effects:
Spatial variation can arise from:
The spatial smooth captures variation but doesn't distinguish causes. Interpret cautiously.
Computational considerations:
k=50) keep computation tractableGAM spatial smooths are convenient but don't replace proper geostatistical methods. They don't estimate variograms, don't handle anisotropy well, and don't give proper kriging predictions. For serious spatial inference, consider dedicated spatial methods.
Time-related effects are natural candidates for smooth modeling: seasonal patterns, long-term trends, and cyclic behavior are often nonlinear but smooth.
Common temporal components:
| Component | Model Term | Interpretation |
|---|---|---|
| Trend | s(time) | Long-term change |
| Seasonality | s(month, bs='cc') | Cyclic annual pattern |
| Day-of-week | s(dow, bs='cc') | Weekly pattern |
| Time-of-day | s(hour, bs='cc') | Daily pattern |
| Trend × Season | te(time, month) | Changing seasonal pattern |
Cyclic smooths:
The bs='cc' option creates a cyclic cubic spline that wraps around: the value and derivatives at the start equal those at the end. Essential for:
Autocorrelation:
Time series often have correlated errors: $\epsilon_t$ correlated with $\epsilon_{t-1}$. Options:
s(lag(y)) models autoregressiongamm() with AR(1) correlationIgnoring autocorrelation doesn't bias estimates but inflates confidence.
GAMs can perform time series decomposition: y ~ s(time) + s(month, bs='cc') + s(dow, bs='cc') + ... The terms separate trend, seasonal, and other components. Unlike classical decomposition, GAMs handle irregular data and include covariates.
Standard GAMs model the conditional mean $E[Y | \mathbf{X}]$. But the mean isn't always the interesting quantity. Quantile GAMs model conditional quantiles, revealing how the entire distribution changes with features.
Why quantiles?
Formulation:
For quantile $\tau \in (0, 1)$:
$$Q_\tau(Y | \mathbf{X}) = \alpha_\tau + \sum_j f_{j\tau}(x_j)$$
Each quantile has its own intercept and smooth functions.
Fitting quantile GAMs:
Use the pinball loss (check function) instead of squared error:
$$\rho_\tau(u) = u(\tau - \mathbf{1}_{u < 0})$$
Minimize: $$\sum_i \rho_\tau(y_i - \hat{f}(\mathbf{x}_i)) + \sum_j \lambda_j \int (f_j'')^2 dx$$
Visualization:
Plot multiple quantile curves (e.g., 10th, 50th, 90th percentiles) on the same axes. The spread between curves shows how variability changes with features.
Packages:
qgam (quantile GAMs), gamlss (distributional regression)quantile_forest, custom implementationsGAMLSS (GAMs for Location, Scale, and Shape) extends GAMs to model all parameters of a distribution: not just mean, but also variance, skewness, and kurtosis as smooth functions of features. This is the ultimate extension for understanding how the entire conditional distribution changes.
When features are functions rather than scalars (e.g., trajectories, spectra, curves), we need functional data analysis methods. GAMs extend naturally to this setting.
Scalar-on-function regression:
$$y = \alpha + \int \beta(t) X(t) , dt + \epsilon$$
Here:
The integral sums the effect of the functional predictor, weighted by $\beta(t)$.
Implementation via GAMs:
Represent $\beta(t)$ using a smooth basis: $$\beta(t) = \sum_k \gamma_k \phi_k(t)$$
The integral becomes a linear combination: $$\int \beta(t) X(t) , dt \approx \sum_k \gamma_k \int \phi_k(t) X(t) , dt = \boldsymbol{\gamma}^\top \mathbf{z}$$
where $z_k = \int \phi_k(t) X(t) , dt$ are pre-computed.
This reduces functional regression to standard penalized regression!
Applications:
The R package 'refund' (Regression with Functional Data) integrates seamlessly with mgcv. It provides specialized terms like pfr() for scalar-on-function regression. The mathematical machinery is all penalized GAMs.
With so many extensions available, choosing the right one requires matching the extension to the data structure and scientific question.
| Data Characteristic | Recommended Extension | Key Functions |
|---|---|---|
| Suspected interactions | Tensor products or varying coefficients | te(), ti(), s(, by=) |
| Clustered/hierarchical data | GAMM with random effects | bs='re', gamm() |
| Longitudinal data | GAMM with subject random effects | s(time) + s(id, bs='re') |
| Spatial coordinates | 2D spatial smooth | s(lat, lon), te(lat, lon) |
| Time series | Cyclic smooths, AR errors | bs='cc', gamm(..., cor=) |
| Non-Gaussian response | GAM with appropriate family | family=binomial/poisson/... |
| Interest in quantiles | Quantile GAM | qgam::qgam() |
| Functional predictors | Scalar-on-function | refund::pfr() |
Decision framework:
y ~ s(x1) + s(x2) + ...The parsimony principle:
More complex models are harder to interpret, slower to fit, and more prone to overfitting. Add extensions only when clearly justified by data and theory.
Every extension adds parameters and computation. A model with 5 tensor products and random effects may be statistically valid but practically opaque. Balance statistical fit against interpretability and stakeholder needs.
Practical GAM modeling involves many small decisions. Here are battle-tested recommendations:
bs='tp' is reasonable. Switch to specific bases only when justified.mgcv handles all these extensions within a unified framework. Learn mgcv deeply: it's the most comprehensive GAM implementation available, actively maintained, and used by researchers worldwide. The ?mgcv documentation is extensive and authoritative.
We have explored the rich landscape of GAM extensions, each addressing specific data structures while preserving the interpretable nature of additive modeling.
Module complete:
You have now completed the comprehensive exploration of Generalized Additive Models. From the foundational additive structure through component functions, fitting algorithms, interpretation, and extensions, you possess the knowledge to apply GAMs to real-world problems with confidence and sophistication.
GAMs occupy a unique position in the modeling landscape: more flexible than linear models, more interpretable than black-box methods, and computationally tractable for large datasets. They are an essential tool in the modern data scientist's arsenal.
Congratulations! You have mastered Generalized Additive Models, from their mathematical foundations to practical implementation. You can now model complex nonlinear relationships while maintaining interpretability, handle hierarchical data, incorporate interactions, and make well-calibrated predictions with uncertainty quantification.