Generalized Additive Models - Learning Module

Loading content...

0/278

Extensions

Beyond Additivity

The standard GAM assumes additive effects: $f(\mathbf{x}) = \alpha + \sum_j f_j(x_j)$. This is powerful, but it's also limiting. What if the effect of temperature on plant growth depends on rainfall? What if we have repeated measurements on patients? What if we need to model spatial data?

The GAM framework is remarkably extensible. The same penalized regression machinery that fits additive models can be adapted to handle interactions, random effects, varying coefficients, spatial fields, and more. These extensions preserve interpretability while dramatically expanding what we can model.

What You Will Learn

By the end of this page, you will understand tensor product smooths for modeling interactions, generalized additive mixed models (GAMMs) for hierarchical data, varying coefficient models, spatial and temporal GAMs, and practical guidance on when to use which extension.

The Interaction Problem

The additive model by definition excludes interactions: the effect of $x_1$ is the same regardless of $x_2$. But many real phenomena involve interactions:

Drug dosing: Efficacy depends on patient weight (drug × weight interaction)
Climate effects: Temperature impact on behavior depends on humidity
Economics: Price elasticity depends on income level
Ecology: Species competition depends on resource availability

Mathematical formulation of interaction:

An interaction means the partial derivative of the response with respect to $x_1$ changes as $x_2$ varies:

$$\frac{\partial^2 f}{\partial x_1 \partial x_2} eq 0$$

Additive models have $\frac{\partial^2 f}{\partial x_1 \partial x_2} = 0$ by construction—no interaction by assumption.

Detecting interactions:

Before adding interaction terms, test whether they're needed:

Residual analysis: Plot residuals against $x_1 \cdot x_2$ or other interaction proxies
Formal testing: Fit model with interaction term, test significance
Domain knowledge: Does theory suggest interactions?
Model comparison: Does AIC/BIC improve with interaction terms?

Adding unnecessary interactions wastes degrees of freedom and complicates interpretation. Add them only when justified.

Interactions Complicate Interpretation

Interaction terms sacrifice the clean 'one plot per feature' interpretation of pure additive models. Before adding interactions, ensure the gain in fit justifies the loss in simplicity. Sometimes a well-fitting additive model is more useful than a slightly better-fitting model with interactions.

Tensor Product Smooths

Tensor product smooths are the standard way to model smooth interactions in GAMs. They generalize the additive structure to allow 2D (or higher) surfaces.

Construction:

Given univariate basis functions ${\phi_k(x)}{k=1}^{K_x}$ for $x$ and ${\psi_l(z)}{l=1}^{K_z}$ for $z$, the tensor product basis is:

$${\phi_k(x) \cdot \psi_l(z)}_{k=1, l=1}^{K_x, K_z}$$

This spans functions of two variables:

$$f(x, z) = \sum_k \sum_l \beta_{kl} \phi_k(x) \psi_l(z)$$

The dimension is $K_x \cdot K_z$—the product of marginal dimensions.

Tensor product penalty:

The penalty combines marginal penalties to control smoothness in each direction:

$$\lambda_x \int \left( \frac{\partial^2 f}{\partial x^2} \right)^2 dx , dz + \lambda_z \int \left( \frac{\partial^2 f}{\partial z^2} \right)^2 dx , dz$$

Two smoothing parameters ($\lambda_x$, $\lambda_z$) allow differential smoothness in each direction—useful when one variable is expected to have a smoother effect than the other.

Notation in model specification:

Notation	Meaning
`s(x, z)`	Isotropic smooth (same scale in both directions, e.g., thin plate)
`te(x, z)`	Tensor product smooth (different scales, separate smoothness penalties)
`ti(x, z)`	Tensor product interaction (excludes main effects)
`t2(x, z)`	Alternative tensor formulation (older)

te() vs s() for Interactions

Use te(x, z) when x and z are on different scales or differ in expected smoothness. Use s(x, z) for isotropic smooths when both variables are comparable (e.g., latitude and longitude). The te() function is more flexible and generally preferred for interactions.

Separating Main Effects and Interactions

A full tensor product $f(x, z)$ contains main effects and interactions confounded together. For interpretation, we often want to separate:

$$f(x, z) = f_x(x) + f_z(z) + f_{xz}(x, z)$$

where $f_x$ and $f_z$ are main effects and $f_{xz}$ is the pure interaction.

The ti() function:

The ti() function (tensor interaction) constructs a smooth that is orthogonal to the main effects—it captures only the interaction pattern. Model specification:

y ~ s(x) + s(z) + ti(x, z)

This gives:

s(x): Marginal effect of $x$, averaged over $z$
s(z): Marginal effect of $z$, averaged over $x$
ti(x, z): How the effect of $x$ changes with $z$ (and vice versa)

Model Specifications with Interactions
Model	Structure	Interpretation
`te(x, z)`	$f(x, z)$ (single surface)	Combined effect, hard to decompose
`s(x) + s(z)`	$f_x(x) + f_z(z)$	Additive main effects, no interaction
`s(x) + s(z) + ti(x, z)`	$f_x(x) + f_z(z) + f_{xz}(x, z)$	Separated main effects and interaction
`s(x, by=group)`	Different $f_x(x)$ per group	Group-varying effects (see later)

Testing for Interaction

With the s(x) + s(z) + ti(x, z) specification, the p-value for ti(x, z) directly tests the null hypothesis of no interaction. This allows testing: 'Given main effects, is there significant interaction?'

Generalized Additive Mixed Models (GAMMs)

Generalized Additive Mixed Models combine GAMs with random effects from mixed models. They handle hierarchical/clustered data where observations within groups are correlated.

Motivation:

Longitudinal data: Repeated measurements on same subjects
Clustered data: Students within schools, patients within hospitals
Spatial heterogeneity: Unmodeled regional effects

Without accounting for clustering, standard errors are too small and p-values are misleading.

GAMM formulation:

$$y_{ij} = \alpha + \sum_k f_k(x_{ijk}) + b_i + \epsilon_{ij}$$

where:

$i$ indexes group (subject, school, etc.)
$j$ indexes observation within group
$b_i \sim N(0, \sigma_b^2)$ is the random effect for group $i$
Fixed smooth effects $f_k$ and random intercepts $b_i$ are estimated jointly

Types of random effects in GAMMs:

Type	Notation	Interpretation
Random intercept	`s(group, bs='re')`	Each group has its own baseline
Random slope	`s(x, group, bs='fs')`	Each group has its own smooth of x
Random smooth	`s(x, by=group, bs='fs')`	Each group has its own function
Spatial random field	`s(lat, lon, bs='gp')`	Correlated random effects in space

Implementation notes:

bs='re' creates a random effect (Gaussian prior on coefficients)
bs='fs' is 'factor smooth' for group-varying curves
Estimation via REML gives valid inference on random effect variances

GAMMs are Unified

In the penalized likelihood framework, smooth terms and random effects are mathematically equivalent—both are Gaussian penalties on coefficients. This unification means GAMMs can be fit with the same machinery as GAMs. The smoothing parameter for a random effect equals 1/variance of that effect.

Varying Coefficient Models

Varying coefficient models allow the effect of one variable to change smoothly as a function of another. This is a specific type of interaction with an interpretable structure.

Formulation:

$$y = \alpha(z) + \beta(z) \cdot x + \epsilon$$

Both the intercept $\alpha$ and the slope $\beta$ can vary smoothly with $z$. This is equivalent to:

$$y = f_0(z) + f_1(z) \cdot x$$

Use cases:

Time-varying effects: Does the effect of a treatment change over time?
Spatially varying effects: Does price sensitivity vary by location?
Effect modification: How does the effect of exercise on health vary with age?

Implementation:

Using the by= argument:

y ~ s(z) + s(z, by=x)

Here:

s(z) is the varying intercept $f_0(z)$
s(z, by=x) is the varying slope $f_1(z) \cdot x$

Interpretation:

Plot $\hat{f}_0(z)$: Predicted response when $x = 0$, as function of $z$
Plot $\hat{f}_1(z)$: Effect of unit increase in $x$, as function of $z$

The varying coefficient plot is directly interpretable as 'how does the marginal effect of $x$ change with $z$?'

Varying Coefficient vs Tensor Product

Varying coefficient: assumes the interaction has the form f(z)·x (effect of x scales smoothly with z). Tensor product: fully flexible 2D surface with no structural assumption. Varying coefficients are more interpretable when the structure is appropriate; tensor products are more flexible when it's not.

Spatial GAMs

Spatial data requires special treatment: observations close in space tend to be correlated. GAMs handle spatial structure through smooth functions of coordinates.

Basic spatial smooth:

$$y_i = \alpha + f(\text{lat}_i, \text{lon}_i) + \boldsymbol{x}_i^\top \boldsymbol{\beta} + \epsilon_i$$

The 2D smooth $f(\text{lat}, \text{lon})$ captures spatial variation not explained by covariates.

Implementation options:

Basis	Code	Properties
Thin plate	`s(lat, lon, bs='tp')`	Isotropic smooth, good default
Tensor product	`te(lat, lon)`	Different smoothness by direction
Gaussian process	`s(lat, lon, bs='gp')`	Explicit spatial correlation
Soap film	`s(lat, lon, bs='so')`	Respects region boundaries

Distinguishing spatial effects:

Spatial variation can arise from:

Confounding: Unmeasured variables that vary spatially
Spatial autocorrelation: True spatial dependence in the outcome
Boundary effects: Edge-of-region artifacts

The spatial smooth captures variation but doesn't distinguish causes. Interpret cautiously.

Computational considerations:

2D smooths require $O(K^2)$ basis functions for $K$ knots per dimension
Low-rank approximations (e.g., k=50) keep computation tractable
For very large spatial datasets, consider specialized packages (INLA, GPyTorch)

Spatial Smoothing ≠ Spatial Statistics

GAM spatial smooths are convenient but don't replace proper geostatistical methods. They don't estimate variograms, don't handle anisotropy well, and don't give proper kriging predictions. For serious spatial inference, consider dedicated spatial methods.

Temporal and Time Series GAMs

Time-related effects are natural candidates for smooth modeling: seasonal patterns, long-term trends, and cyclic behavior are often nonlinear but smooth.

Common temporal components:

Component	Model Term	Interpretation
Trend	`s(time)`	Long-term change
Seasonality	`s(month, bs='cc')`	Cyclic annual pattern
Day-of-week	`s(dow, bs='cc')`	Weekly pattern
Time-of-day	`s(hour, bs='cc')`	Daily pattern
Trend × Season	`te(time, month)`	Changing seasonal pattern

Cyclic smooths:

The bs='cc' option creates a cyclic cubic spline that wraps around: the value and derivatives at the start equal those at the end. Essential for:

Months (December connects to January)
Hours (23:00 connects to 00:00)
Angles (360° = 0°)

Autocorrelation:

Time series often have correlated errors: $\epsilon_t$ correlated with $\epsilon_{t-1}$. Options:

Include lagged smooths: s(lag(y)) models autoregression
Correlated error structures: Use gamm() with AR(1) correlation
Residual checking: Plot ACF/PACF of residuals to detect remaining correlation

Ignoring autocorrelation doesn't bias estimates but inflates confidence.

Decomposition via GAMs

GAMs can perform time series decomposition: y ~ s(time) + s(month, bs='cc') + s(dow, bs='cc') + ... The terms separate trend, seasonal, and other components. Unlike classical decomposition, GAMs handle irregular data and include covariates.

Quantile GAMs and Beyond Mean Regression

Standard GAMs model the conditional mean $E[Y | \mathbf{X}]$. But the mean isn't always the interesting quantity. Quantile GAMs model conditional quantiles, revealing how the entire distribution changes with features.

Why quantiles?

Heteroscedasticity: Variance changes with $x$
Skewed distributions: Mean doesn't represent typical values
Tail behavior: Interest in extremes (e.g., 95th percentile)
Uncertainty quantification: Predict intervals, not just points

Formulation:

For quantile $\tau \in (0, 1)$:

$$Q_\tau(Y | \mathbf{X}) = \alpha_\tau + \sum_j f_{j\tau}(x_j)$$

Each quantile has its own intercept and smooth functions.

Fitting quantile GAMs:

Use the pinball loss (check function) instead of squared error:

$$\rho_\tau(u) = u(\tau - \mathbf{1}_{u < 0})$$

Minimize: $$\sum_i \rho_\tau(y_i - \hat{f}(\mathbf{x}_i)) + \sum_j \lambda_j \int (f_j'')^2 dx$$

Visualization:

Plot multiple quantile curves (e.g., 10th, 50th, 90th percentiles) on the same axes. The spread between curves shows how variability changes with features.

Packages:

R: qgam (quantile GAMs), gamlss (distributional regression)
Python: quantile_forest, custom implementations

Distributional Regression

GAMLSS (GAMs for Location, Scale, and Shape) extends GAMs to model all parameters of a distribution: not just mean, but also variance, skewness, and kurtosis as smooth functions of features. This is the ultimate extension for understanding how the entire conditional distribution changes.

Functional Data and Scalar-on-Function Regression

When features are functions rather than scalars (e.g., trajectories, spectra, curves), we need functional data analysis methods. GAMs extend naturally to this setting.

Scalar-on-function regression:

$$y = \alpha + \int \beta(t) X(t) , dt + \epsilon$$

Here:

$y$ is a scalar response
$X(t)$ is a functional predictor (a curve observed at many points $t$)
$\beta(t)$ is a coefficient function to be estimated

The integral sums the effect of the functional predictor, weighted by $\beta(t)$.

Implementation via GAMs:

Represent $\beta(t)$ using a smooth basis: $$\beta(t) = \sum_k \gamma_k \phi_k(t)$$

The integral becomes a linear combination: $$\int \beta(t) X(t) , dt \approx \sum_k \gamma_k \int \phi_k(t) X(t) , dt = \boldsymbol{\gamma}^\top \mathbf{z}$$

where $z_k = \int \phi_k(t) X(t) , dt$ are pre-computed.

This reduces functional regression to standard penalized regression!

Applications:

Spectrometry: Predict chemical properties from spectral curves
Weather: Relate daily temperature curves to outcomes
Motion capture: Link movement trajectories to labels
Wearables: Use accelerometer curves as predictors

The refund Package

The R package 'refund' (Regression with Functional Data) integrates seamlessly with mgcv. It provides specialized terms like pfr() for scalar-on-function regression. The mathematical machinery is all penalized GAMs.

Choosing the Right Extension

With so many extensions available, choosing the right one requires matching the extension to the data structure and scientific question.

Extension Selection Guide
Data Characteristic	Recommended Extension	Key Functions
Suspected interactions	Tensor products or varying coefficients	`te()`, `ti()`, `s(, by=)`
Clustered/hierarchical data	GAMM with random effects	`bs='re'`, `gamm()`
Longitudinal data	GAMM with subject random effects	`s(time) + s(id, bs='re')`
Spatial coordinates	2D spatial smooth	`s(lat, lon)`, `te(lat, lon)`
Time series	Cyclic smooths, AR errors	`bs='cc'`, `gamm(..., cor=)`
Non-Gaussian response	GAM with appropriate family	`family=binomial/poisson/...`
Interest in quantiles	Quantile GAM	`qgam::qgam()`
Functional predictors	Scalar-on-function	`refund::pfr()`

Decision framework:

Start simple: Begin with additive model y ~ s(x1) + s(x2) + ...
Check residuals: Look for patterns suggesting missing structure
Add complexity as needed: Interactions, random effects, etc.
Compare models: Use AIC, BIC, or cross-validation
Validate: Ensure extensions improve out-of-sample performance

The parsimony principle:

More complex models are harder to interpret, slower to fit, and more prone to overfitting. Add extensions only when clearly justified by data and theory.

Complexity Has Costs

Every extension adds parameters and computation. A model with 5 tensor products and random effects may be statistically valid but practically opaque. Balance statistical fit against interpretability and stakeholder needs.

Implementation Tips and Best Practices

Practical GAM modeling involves many small decisions. Here are battle-tested recommendations:

Best Practices for Extended GAMs

•Set k large enough: Use k=10–30 for main effects, higher for interactions. If EDF ≈ k, consider increasing k.
•Use REML for smoothness selection: More stable than GCV, especially for small samples.
•Check concurvity: High concurvity (correlated smooths) destabilizes estimates. Consider removing or combining features.
•Scale continuous predictors: Centering and scaling improves numerical stability, especially for tensor products.
•Start with thin plate splines: Default bs='tp' is reasonable. Switch to specific bases only when justified.
•Use bam() for big data: Scales to millions of observations with discretization tricks.
•Always check residuals: No amount of extension sophistication replaces basic diagnostic checking.
•Document model choices: Record why each term was included. Future-you will thank present-you.

The mgcv Ecosystem

mgcv handles all these extensions within a unified framework. Learn mgcv deeply: it's the most comprehensive GAM implementation available, actively maintained, and used by researchers worldwide. The ?mgcv documentation is extensive and authoritative.

Summary: The Full GAM Toolkit

We have explored the rich landscape of GAM extensions, each addressing specific data structures while preserving the interpretable nature of additive modeling.

Key Takeaways

•Tensor product smooths enable smooth interaction surfaces with separate smoothness control in each direction
•ti() separates main effects from interactions for cleaner interpretation and testing
•GAMMs incorporate random effects for clustered and longitudinal data using the same penalized framework
•Varying coefficient models allow effects to change smoothly with a modifier variable
•Spatial and temporal smooths capture geographic and time-related variation
•Cyclic splines handle periodic variables that wrap around
•Quantile GAMs model entire conditional distributions, not just means
•Functional predictors can be incorporated via integral representations
•Start simple, add complexity as needed — let the data justify extensions

Module complete:

You have now completed the comprehensive exploration of Generalized Additive Models. From the foundational additive structure through component functions, fitting algorithms, interpretation, and extensions, you possess the knowledge to apply GAMs to real-world problems with confidence and sophistication.

GAMs occupy a unique position in the modeling landscape: more flexible than linear models, more interpretable than black-box methods, and computationally tractable for large datasets. They are an essential tool in the modern data scientist's arsenal.

Module Complete

Congratulations! You have mastered Generalized Additive Models, from their mathematical foundations to practical implementation. You can now model complex nonlinear relationships while maintaining interpretability, handle hierarchical data, incorporate interactions, and make well-calibrated predictions with uncertainty quantification.

Extensions

Beyond Additivity

What You Will Learn

The Interaction Problem

The additive model by definition excludes interactions: the effect of $x_1$ is the same regardless of $x_2$. But many real phenomena involve interactions:

Drug dosing: Efficacy depends on patient weight (drug × weight interaction)
Climate effects: Temperature impact on behavior depends on humidity
Economics: Price elasticity depends on income level
Ecology: Species competition depends on resource availability

Mathematical formulation of interaction:

An interaction means the partial derivative of the response with respect to $x_1$ changes as $x_2$ varies:

$$\frac{\partial^2 f}{\partial x_1 \partial x_2} eq 0$$

Additive models have $\frac{\partial^2 f}{\partial x_1 \partial x_2} = 0$ by construction—no interaction by assumption.

Detecting interactions:

Before adding interaction terms, test whether they're needed:

Residual analysis: Plot residuals against $x_1 \cdot x_2$ or other interaction proxies
Formal testing: Fit model with interaction term, test significance
Domain knowledge: Does theory suggest interactions?
Model comparison: Does AIC/BIC improve with interaction terms?

Adding unnecessary interactions wastes degrees of freedom and complicates interpretation. Add them only when justified.

Interactions Complicate Interpretation

Tensor Product Smooths

Tensor product smooths are the standard way to model smooth interactions in GAMs. They generalize the additive structure to allow 2D (or higher) surfaces.

Construction:

Given univariate basis functions ${\phi_k(x)}{k=1}^{K_x}$ for $x$ and ${\psi_l(z)}{l=1}^{K_z}$ for $z$, the tensor product basis is:

$${\phi_k(x) \cdot \psi_l(z)}_{k=1, l=1}^{K_x, K_z}$$

This spans functions of two variables:

$$f(x, z) = \sum_k \sum_l \beta_{kl} \phi_k(x) \psi_l(z)$$

The dimension is $K_x \cdot K_z$—the product of marginal dimensions.

Tensor product penalty:

The penalty combines marginal penalties to control smoothness in each direction:

$$\lambda_x \int \left( \frac{\partial^2 f}{\partial x^2} \right)^2 dx , dz + \lambda_z \int \left( \frac{\partial^2 f}{\partial z^2} \right)^2 dx , dz$$

Two smoothing parameters ($\lambda_x$, $\lambda_z$) allow differential smoothness in each direction—useful when one variable is expected to have a smoother effect than the other.

Notation in model specification:

Notation	Meaning
`s(x, z)`	Isotropic smooth (same scale in both directions, e.g., thin plate)
`te(x, z)`	Tensor product smooth (different scales, separate smoothness penalties)
`ti(x, z)`	Tensor product interaction (excludes main effects)
`t2(x, z)`	Alternative tensor formulation (older)

te() vs s() for Interactions

Separating Main Effects and Interactions

A full tensor product $f(x, z)$ contains main effects and interactions confounded together. For interpretation, we often want to separate:

$$f(x, z) = f_x(x) + f_z(z) + f_{xz}(x, z)$$

where $f_x$ and $f_z$ are main effects and $f_{xz}$ is the pure interaction.

The ti() function:

The ti() function (tensor interaction) constructs a smooth that is orthogonal to the main effects—it captures only the interaction pattern. Model specification:

y ~ s(x) + s(z) + ti(x, z)

This gives:

s(x): Marginal effect of $x$, averaged over $z$
s(z): Marginal effect of $z$, averaged over $x$
ti(x, z): How the effect of $x$ changes with $z$ (and vice versa)

Model Specifications with Interactions
Model	Structure	Interpretation
`te(x, z)`	$f(x, z)$ (single surface)	Combined effect, hard to decompose
`s(x) + s(z)`	$f_x(x) + f_z(z)$	Additive main effects, no interaction
`s(x) + s(z) + ti(x, z)`	$f_x(x) + f_z(z) + f_{xz}(x, z)$	Separated main effects and interaction
`s(x, by=group)`	Different $f_x(x)$ per group	Group-varying effects (see later)

Testing for Interaction

Generalized Additive Mixed Models (GAMMs)

Generalized Additive Mixed Models combine GAMs with random effects from mixed models. They handle hierarchical/clustered data where observations within groups are correlated.

Motivation:

Longitudinal data: Repeated measurements on same subjects
Clustered data: Students within schools, patients within hospitals
Spatial heterogeneity: Unmodeled regional effects

Without accounting for clustering, standard errors are too small and p-values are misleading.

GAMM formulation:

$$y_{ij} = \alpha + \sum_k f_k(x_{ijk}) + b_i + \epsilon_{ij}$$

where:

$i$ indexes group (subject, school, etc.)
$j$ indexes observation within group
$b_i \sim N(0, \sigma_b^2)$ is the random effect for group $i$
Fixed smooth effects $f_k$ and random intercepts $b_i$ are estimated jointly

Types of random effects in GAMMs:

Type	Notation	Interpretation
Random intercept	`s(group, bs='re')`	Each group has its own baseline
Random slope	`s(x, group, bs='fs')`	Each group has its own smooth of x
Random smooth	`s(x, by=group, bs='fs')`	Each group has its own function
Spatial random field	`s(lat, lon, bs='gp')`	Correlated random effects in space

Implementation notes:

bs='re' creates a random effect (Gaussian prior on coefficients)
bs='fs' is 'factor smooth' for group-varying curves
Estimation via REML gives valid inference on random effect variances

GAMMs are Unified

Varying Coefficient Models

Varying coefficient models allow the effect of one variable to change smoothly as a function of another. This is a specific type of interaction with an interpretable structure.

Formulation:

$$y = \alpha(z) + \beta(z) \cdot x + \epsilon$$

Both the intercept $\alpha$ and the slope $\beta$ can vary smoothly with $z$. This is equivalent to:

$$y = f_0(z) + f_1(z) \cdot x$$

Use cases:

Time-varying effects: Does the effect of a treatment change over time?
Spatially varying effects: Does price sensitivity vary by location?
Effect modification: How does the effect of exercise on health vary with age?

Implementation:

Using the by= argument:

y ~ s(z) + s(z, by=x)

Here:

s(z) is the varying intercept $f_0(z)$
s(z, by=x) is the varying slope $f_1(z) \cdot x$

Interpretation:

Plot $\hat{f}_0(z)$: Predicted response when $x = 0$, as function of $z$
Plot $\hat{f}_1(z)$: Effect of unit increase in $x$, as function of $z$

The varying coefficient plot is directly interpretable as 'how does the marginal effect of $x$ change with $z$?'

Varying Coefficient vs Tensor Product

Spatial GAMs

Spatial data requires special treatment: observations close in space tend to be correlated. GAMs handle spatial structure through smooth functions of coordinates.

Basic spatial smooth:

$$y_i = \alpha + f(\text{lat}_i, \text{lon}_i) + \boldsymbol{x}_i^\top \boldsymbol{\beta} + \epsilon_i$$

The 2D smooth $f(\text{lat}, \text{lon})$ captures spatial variation not explained by covariates.

Implementation options:

Basis	Code	Properties
Thin plate	`s(lat, lon, bs='tp')`	Isotropic smooth, good default
Tensor product	`te(lat, lon)`	Different smoothness by direction
Gaussian process	`s(lat, lon, bs='gp')`	Explicit spatial correlation
Soap film	`s(lat, lon, bs='so')`	Respects region boundaries

Distinguishing spatial effects:

Spatial variation can arise from:

Confounding: Unmeasured variables that vary spatially
Spatial autocorrelation: True spatial dependence in the outcome
Boundary effects: Edge-of-region artifacts

The spatial smooth captures variation but doesn't distinguish causes. Interpret cautiously.

Computational considerations:

2D smooths require $O(K^2)$ basis functions for $K$ knots per dimension
Low-rank approximations (e.g., k=50) keep computation tractable
For very large spatial datasets, consider specialized packages (INLA, GPyTorch)

Spatial Smoothing ≠ Spatial Statistics

Temporal and Time Series GAMs

Time-related effects are natural candidates for smooth modeling: seasonal patterns, long-term trends, and cyclic behavior are often nonlinear but smooth.

Common temporal components:

Component	Model Term	Interpretation
Trend	`s(time)`	Long-term change
Seasonality	`s(month, bs='cc')`	Cyclic annual pattern
Day-of-week	`s(dow, bs='cc')`	Weekly pattern
Time-of-day	`s(hour, bs='cc')`	Daily pattern
Trend × Season	`te(time, month)`	Changing seasonal pattern

Cyclic smooths:

The bs='cc' option creates a cyclic cubic spline that wraps around: the value and derivatives at the start equal those at the end. Essential for:

Months (December connects to January)
Hours (23:00 connects to 00:00)
Angles (360° = 0°)

Autocorrelation:

Time series often have correlated errors: $\epsilon_t$ correlated with $\epsilon_{t-1}$. Options:

Include lagged smooths: s(lag(y)) models autoregression
Correlated error structures: Use gamm() with AR(1) correlation
Residual checking: Plot ACF/PACF of residuals to detect remaining correlation

Ignoring autocorrelation doesn't bias estimates but inflates confidence.

Decomposition via GAMs

Quantile GAMs and Beyond Mean Regression

Why quantiles?

Heteroscedasticity: Variance changes with $x$
Skewed distributions: Mean doesn't represent typical values
Tail behavior: Interest in extremes (e.g., 95th percentile)
Uncertainty quantification: Predict intervals, not just points

Formulation:

For quantile $\tau \in (0, 1)$:

$$Q_\tau(Y | \mathbf{X}) = \alpha_\tau + \sum_j f_{j\tau}(x_j)$$

Each quantile has its own intercept and smooth functions.

Fitting quantile GAMs:

Use the pinball loss (check function) instead of squared error:

$$\rho_\tau(u) = u(\tau - \mathbf{1}_{u < 0})$$

Minimize: $$\sum_i \rho_\tau(y_i - \hat{f}(\mathbf{x}_i)) + \sum_j \lambda_j \int (f_j'')^2 dx$$

Visualization:

Plot multiple quantile curves (e.g., 10th, 50th, 90th percentiles) on the same axes. The spread between curves shows how variability changes with features.

Packages:

R: qgam (quantile GAMs), gamlss (distributional regression)
Python: quantile_forest, custom implementations

Distributional Regression

Functional Data and Scalar-on-Function Regression

When features are functions rather than scalars (e.g., trajectories, spectra, curves), we need functional data analysis methods. GAMs extend naturally to this setting.

Scalar-on-function regression:

$$y = \alpha + \int \beta(t) X(t) , dt + \epsilon$$

Here:

$y$ is a scalar response
$X(t)$ is a functional predictor (a curve observed at many points $t$)
$\beta(t)$ is a coefficient function to be estimated

The integral sums the effect of the functional predictor, weighted by $\beta(t)$.

Implementation via GAMs:

Represent $\beta(t)$ using a smooth basis: $$\beta(t) = \sum_k \gamma_k \phi_k(t)$$

The integral becomes a linear combination: $$\int \beta(t) X(t) , dt \approx \sum_k \gamma_k \int \phi_k(t) X(t) , dt = \boldsymbol{\gamma}^\top \mathbf{z}$$

where $z_k = \int \phi_k(t) X(t) , dt$ are pre-computed.

This reduces functional regression to standard penalized regression!

Applications:

Spectrometry: Predict chemical properties from spectral curves
Weather: Relate daily temperature curves to outcomes
Motion capture: Link movement trajectories to labels
Wearables: Use accelerometer curves as predictors

The refund Package

Choosing the Right Extension

With so many extensions available, choosing the right one requires matching the extension to the data structure and scientific question.

Extension Selection Guide
Data Characteristic	Recommended Extension	Key Functions
Suspected interactions	Tensor products or varying coefficients	`te()`, `ti()`, `s(, by=)`
Clustered/hierarchical data	GAMM with random effects	`bs='re'`, `gamm()`
Longitudinal data	GAMM with subject random effects	`s(time) + s(id, bs='re')`
Spatial coordinates	2D spatial smooth	`s(lat, lon)`, `te(lat, lon)`
Time series	Cyclic smooths, AR errors	`bs='cc'`, `gamm(..., cor=)`
Non-Gaussian response	GAM with appropriate family	`family=binomial/poisson/...`
Interest in quantiles	Quantile GAM	`qgam::qgam()`
Functional predictors	Scalar-on-function	`refund::pfr()`

Decision framework:

Start simple: Begin with additive model y ~ s(x1) + s(x2) + ...
Check residuals: Look for patterns suggesting missing structure
Add complexity as needed: Interactions, random effects, etc.
Compare models: Use AIC, BIC, or cross-validation
Validate: Ensure extensions improve out-of-sample performance

The parsimony principle:

More complex models are harder to interpret, slower to fit, and more prone to overfitting. Add extensions only when clearly justified by data and theory.

Complexity Has Costs

Implementation Tips and Best Practices

Practical GAM modeling involves many small decisions. Here are battle-tested recommendations:

Best Practices for Extended GAMs

•Set k large enough: Use k=10–30 for main effects, higher for interactions. If EDF ≈ k, consider increasing k.
•Use REML for smoothness selection: More stable than GCV, especially for small samples.
•Check concurvity: High concurvity (correlated smooths) destabilizes estimates. Consider removing or combining features.
•Scale continuous predictors: Centering and scaling improves numerical stability, especially for tensor products.
•Start with thin plate splines: Default bs='tp' is reasonable. Switch to specific bases only when justified.
•Use bam() for big data: Scales to millions of observations with discretization tricks.
•Always check residuals: No amount of extension sophistication replaces basic diagnostic checking.
•Document model choices: Record why each term was included. Future-you will thank present-you.

The mgcv Ecosystem

Summary: The Full GAM Toolkit

We have explored the rich landscape of GAM extensions, each addressing specific data structures while preserving the interpretable nature of additive modeling.

Key Takeaways

•Tensor product smooths enable smooth interaction surfaces with separate smoothness control in each direction
•ti() separates main effects from interactions for cleaner interpretation and testing
•GAMMs incorporate random effects for clustered and longitudinal data using the same penalized framework
•Varying coefficient models allow effects to change smoothly with a modifier variable
•Spatial and temporal smooths capture geographic and time-related variation
•Cyclic splines handle periodic variables that wrap around
•Quantile GAMs model entire conditional distributions, not just means
•Functional predictors can be incorporated via integral representations
•Start simple, add complexity as needed — let the data justify extensions

Module complete:

Module Complete