Loading content...
A Gaussian Process is completely specified by two functions: the mean function $m(\mathbf{x})$ and the covariance function (kernel) $k(\mathbf{x}, \mathbf{x}')$. These two functions encode all our prior beliefs about the function we're trying to learn.
The mean function tells us what function values we expect a priori. The covariance function tells us how function values at different inputs are related—how correlated they are, how quickly correlations decay with distance, what patterns of variation we expect.
The art of GP modeling is choosing $m$ and $k$ to reflect genuine prior knowledge about your problem. This page provides the vocabulary and toolkit to make those choices wisely.
By the end of this page, you will understand common mean function choices and when to use them, master the major kernel families (RBF, Matérn, periodic, linear, and more), learn how to combine kernels via addition and multiplication to create sophisticated priors, and develop intuition for how kernel hyperparameters affect function samples.
The mean function $m: \mathcal{X} \to \mathbb{R}$ specifies the expected value of the function at every input:
$$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$
It represents the 'center' of our prior distribution in function space—the single function around which samples are distributed.
Common Mean Function Choices:
1. Zero Mean: $m(\mathbf{x}) = 0$
The most common choice. After observing data, the posterior mean becomes non-zero where data informs us. Far from data, predictions revert to zero.
2. Constant Mean: $m(\mathbf{x}) = \mu_0$
Useful when you expect the function to fluctuate around a known level. The constant $\mu_0$ can be a hyperparameter learned from data.
3. Linear Mean: $m(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b$
Captures an expected linear trend. The GP then models deviations from this trend. Common in time series with known drift.
4. Polynomial Mean: $m(\mathbf{x}) = \sum_{j=0}^{p} w_j x^j$
Generalizes linear mean to capture polynomial trends. Be careful of overfitting the mean—the GP becomes less flexible.
5. Basis Function Mean: $m(\mathbf{x}) = \sum_{j=1}^{J} \beta_j h_j(\mathbf{x})$
Arbitrary basis functions $h_j$ encode domain knowledge. Examples: Fourier bases for known periodicities, physics-based functions for scientific applications.
6. Parametric Model Mean: $m(\mathbf{x}) = g(\mathbf{x}; \boldsymbol{\theta})$
Any parametric model can serve as the mean. The GP captures what the parametric model misses—a powerful 'residual modeling' approach.
Start with zero mean. It's simple, requires no extra hyperparameters, and works well after centering data. Add a non-zero mean only when: (1) you have genuine prior knowledge of trends, (2) extrapolation behavior matters and you want specific far-field behavior, or (3) the posterior fit is poor with zero mean. The kernel typically matters more than the mean.
| Type | Formula | Hyperparameters | When to Use |
|---|---|---|---|
| Zero | $0$ | None | Default; after centering data |
| Constant | $\mu_0$ | $\mu_0$ | Known baseline level |
| Linear | $\mathbf{w}^\top \mathbf{x} + b$ | $\mathbf{w}, b$ | Known linear trend |
| Polynomial | $\sum_j w_j x^j$ | ${w_j}$ | Known polynomial trend |
| Physics-based | $g(\mathbf{x}; \boldsymbol{\theta})$ | Model-specific | Physical laws governing system |
The covariance function (kernel) $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is the most important component of a GP. It determines:
Formal Requirements: A valid covariance function must produce positive semi-definite Gram matrices for any set of inputs:
$$\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$$
with $\mathbf{K} \succeq 0$ (all eigenvalues non-negative).
Every kernel choice is a statement about what functions you consider plausible before seeing data. Choosing an RBF kernel says: 'I believe the true function is very smooth.' Choosing Matérn-1/2 says: 'I expect rougher, less predictable behavior.' Choosing a periodic kernel says: 'I expect repeating patterns.' Make these choices consciously.
The Squared Exponential (SE) kernel, also called the Radial Basis Function (RBF) or Gaussian kernel, is the most widely used kernel in GP applications.
Definition (1D): $$k(x, x') = \sigma_f^2 \exp\left(-\frac{(x - x')^2}{2\ell^2}\right)$$
Definition (Multi-dimensional): $$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$
Hyperparameters:
Key Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0): """Squared Exponential (RBF) kernel""" sqdist = np.subtract.outer(x1, x2)**2 return variance * np.exp(-0.5 * sqdist / length_scale**2) # Visualize kernel and samples for different length scalesx = np.linspace(0, 10, 200)length_scales = [0.5, 1.0, 2.0, 5.0] fig, axes = plt.subplots(2, 4, figsize=(16, 8)) for idx, ls in enumerate(length_scales): # Plot kernel function (correlation vs distance) ax_kernel = axes[0, idx] distances = np.linspace(0, 5, 100) correlations = np.exp(-0.5 * distances**2 / ls**2) ax_kernel.plot(distances, correlations, 'b-', linewidth=2) ax_kernel.set_xlabel('Distance |x - x\'|') ax_kernel.set_ylabel('k(x, x\')') ax_kernel.set_title(f'ℓ = {ls}') ax_kernel.axhline(y=0.05, color='r', linestyle='--', alpha=0.5, label='5% correlation') ax_kernel.set_ylim([0, 1.05]) ax_kernel.grid(True, alpha=0.3) # Plot GP samples ax_samples = axes[1, idx] K = rbf_kernel(x, x, length_scale=ls) + 1e-8 * np.eye(len(x)) L = cholesky(K, lower=True) np.random.seed(42) for i in range(3): sample = L @ np.random.randn(len(x)) ax_samples.plot(x, sample, alpha=0.8) ax_samples.set_xlabel('x') ax_samples.set_ylabel('f(x)') ax_samples.set_title(f'Samples (ℓ = {ls})') ax_samples.grid(True, alpha=0.3) ax_samples.set_ylim([-3, 3]) plt.suptitle('RBF Kernel: Effect of Length Scale', fontsize=14)plt.tight_layout()plt.show() print("Larger length scale → smoother, more slowly-varying functions")print("Smaller length scale → more rapidly-varying functions")The infinite smoothness of RBF samples is often unrealistic. Real physical phenomena (temperature, wind, stock prices) typically have finite-order derivatives. If your data suggests rougher behavior than RBF can capture, consider Matérn kernels. RBF is a good starting point, but not always the best choice.
The Matérn kernel family generalizes the RBF kernel by introducing a smoothness parameter $ u$ that controls sample differentiability.
General Definition: $$k(r) = \sigma_f^2 \frac{2^{1- u}}{\Gamma( u)} \left(\frac{\sqrt{2 u} r}{\ell}\right)^ u K_ u\left(\frac{\sqrt{2 u} r}{\ell}\right)$$
where $r = |\mathbf{x} - \mathbf{x}'|$, $\Gamma$ is the gamma function, and $K_ u$ is the modified Bessel function.
Smoothness Interpretation: Sample functions from Matérn-$ u$ are $\lceil u \rceil - 1$ times differentiable.
Special Cases (commonly used closed forms):
Matérn 1/2 (Exponential): $$k(r) = \sigma_f^2 \exp\left(-\frac{r}{\ell}\right)$$ Samples are continuous but not differentiable (rough).
Matérn 3/2: $$k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3} r}{\ell}\right) \exp\left(-\frac{\sqrt{3} r}{\ell}\right)$$ Samples are once differentiable.
Matérn 5/2: $$k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5} r}{\ell} + \frac{5 r^2}{3 \ell^2}\right) \exp\left(-\frac{\sqrt{5} r}{\ell}\right)$$ Samples are twice differentiable.
| Kernel | $ u$ | Differentiability | Roughness | Typical Use |
|---|---|---|---|---|
| Matérn 1/2 | 0.5 | Continuous only | Very rough | Rough natural phenomena |
| Matérn 3/2 | 1.5 | Once differentiable | Moderate | Physical processes |
| Matérn 5/2 | 2.5 | Twice differentiable | Smooth | Engineering, ML default |
| RBF | $\infty$ | Infinitely differentiable | Very smooth | When smoothness assumed |
Matérn 5/2 is increasingly preferred over RBF in practice. It's smooth enough for most applications (twice differentiable) while being more realistic than RBF's infinite smoothness. Machine learning libraries (GPy, GPflow, Botorch) often use Matérn 5/2 as the default. When in doubt, try Matérn 5/2 first.
Periodic kernels encode prior belief that the function repeats with a certain period. This is essential for modeling seasonal patterns, oscillations, and cyclic phenomena.
Standard Periodic Kernel: $$k(x, x') = \sigma_f^2 \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2}\right)$$
Hyperparameters:
Derivation Insight: This kernel is obtained by mapping inputs to a circle of circumference $p$: $$x \mapsto [\cos(2\pi x / p), \sin(2\pi x / p)]$$ and applying an RBF kernel in this 2D embedding. Points separated by exactly period $p$ map to the same location!
Locally Periodic Kernel (Periodic × RBF):
Pure periodicity extends forever—but real patterns often decay over time. Multiplying by an RBF creates functions that are locally periodic but globally decay:
$$k(x, x') = \sigma_f^2 \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell_p^2}\right) \exp\left(-\frac{(x - x')^2}{2\ell_{\text{decay}}^2}\right)$$
This captures phenomena like damped oscillations or seasonal patterns that evolve over time.
Quasi-Periodic Kernel (Periodic + RBF):
Adding (not multiplying) periodic and RBF components captures periodicity with additional non-periodic variation:
$$k(x, x') = k_{\text{periodic}}(x, x') + k_{\text{RBF}}(x, x')$$
This models signals with both a periodic component and a smooth trend.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky def periodic_kernel(x1, x2, period=1.0, length_scale=1.0, variance=1.0): """Standard periodic kernel""" diff = np.subtract.outer(x1, x2) return variance * np.exp(-2 * np.sin(np.pi * np.abs(diff) / period)**2 / length_scale**2) def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0): """RBF kernel for combination""" sqdist = np.subtract.outer(x1, x2)**2 return variance * np.exp(-0.5 * sqdist / length_scale**2) # Compare periodic, locally periodic, and quasi-periodicnp.random.seed(42)x = np.linspace(0, 15, 300)n = len(x) fig, axes = plt.subplots(2, 3, figsize=(15, 8)) # Pure periodicK_periodic = periodic_kernel(x, x, period=2.0, length_scale=0.5) + 1e-8*np.eye(n)L = cholesky(K_periodic, lower=True)for i in range(3): axes[0, 0].plot(x, L @ np.random.randn(n), alpha=0.8)axes[0, 0].set_title('Pure Periodic (p=2)')axes[0, 0].set_xlabel('x')axes[0, 0].set_ylabel('f(x)')axes[0, 0].grid(True, alpha=0.3) # Locally periodic (Periodic × RBF)K_local = periodic_kernel(x, x, period=2.0, length_scale=0.5) * \ rbf_kernel(x, x, length_scale=5.0) + 1e-8*np.eye(n)L = cholesky(K_local, lower=True)np.random.seed(42)for i in range(3): axes[0, 1].plot(x, L @ np.random.randn(n), alpha=0.8)axes[0, 1].set_title('Locally Periodic (Periodic × RBF)')axes[0, 1].set_xlabel('x')axes[0, 1].grid(True, alpha=0.3) # Quasi-periodic (Periodic + RBF)K_quasi = periodic_kernel(x, x, period=2.0, length_scale=0.5, variance=0.5) + \ rbf_kernel(x, x, length_scale=3.0, variance=0.5) + 1e-8*np.eye(n)L = cholesky(K_quasi, lower=True)np.random.seed(42)for i in range(3): axes[0, 2].plot(x, L @ np.random.randn(n), alpha=0.8)axes[0, 2].set_title('Quasi-Periodic (Periodic + RBF)')axes[0, 2].set_xlabel('x')axes[0, 2].grid(True, alpha=0.3) # Show kernel functions (correlation vs lag)lags = np.linspace(0, 10, 200)zero = np.zeros(1) for idx, (name, k_fn) in enumerate([ ('Pure Periodic', lambda: periodic_kernel(lags, zero, period=2.0, length_scale=0.5).flatten()), ('Locally Periodic', lambda: (periodic_kernel(lags, zero, period=2.0, length_scale=0.5) * rbf_kernel(lags, zero, length_scale=5.0)).flatten()), ('Quasi-Periodic', lambda: (periodic_kernel(lags, zero, period=2.0, length_scale=0.5, variance=0.5) + rbf_kernel(lags, zero, length_scale=3.0, variance=0.5)).flatten())]): axes[1, idx].plot(lags, k_fn(), 'b-', linewidth=2) axes[1, idx].set_xlabel('Lag τ') axes[1, idx].set_ylabel('k(τ)') axes[1, idx].set_title(f'{name} Kernel') axes[1, idx].grid(True, alpha=0.3) plt.tight_layout()plt.show()Linear Kernel: $$k(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - \mathbf{c})^\top (\mathbf{x}' - \mathbf{c})$$
Hyperparameters:
Properties:
Polynomial Kernel: $$k(\mathbf{x}, \mathbf{x}') = (\sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}')^p$$
where $p$ is the polynomial degree.
Properties:
Linear + SE Combination:
A powerful pattern is combining linear and SE kernels: $$k(\mathbf{x}, \mathbf{x}') = k_{\text{linear}}(\mathbf{x}, \mathbf{x}') + k_{\text{SE}}(\mathbf{x}, \mathbf{x}')$$
This captures both a linear trend and smooth deviations from that trend. Common in modeling phenomena with known linear behavior plus noise.
Linear kernels are perfect for expressing 'I believe the relationship is approximately linear.' Combined with other kernels, they capture trends: Linear + SE for linear trend + smooth deviation, Linear × Periodic for linearly changing amplitude of oscillation. The linear kernel is often underrated—simple but powerful.
The power of kernel methods lies in compositionality: simple kernels can be combined to create sophisticated priors that capture complex patterns.
Valid Operations on Kernels:
1. Addition: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$
Represents superposition of independent components. If $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent, then $f_1 + f_2 \sim \mathcal{GP}(0, k_1 + k_2)$.
Use case: Trend + seasonal + noise decomposition.
2. Multiplication: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$
Represents interaction/modulation. Functions from $k_1$ modulate the structure of $k_2$.
Use case: Periodicity that decays (Periodic × RBF), varying length scale (Linear × SE).
3. Scaling: $k(\mathbf{x}, \mathbf{x}') = c \cdot k_1(\mathbf{x}, \mathbf{x}')$ for $c > 0$
Scales the variance of the component.
4. Exponentiation: $k(\mathbf{x}, \mathbf{x}') = \exp(k_1(\mathbf{x}, \mathbf{x}'))$
Transforms kernel values; preserves PSD property.
5. Composition with Functions: $k(\mathbf{x}, \mathbf{x}') = k_1(g(\mathbf{x}), g(\mathbf{x}'))$
Applying any function $g$ to inputs before kernelization is valid. Used for input warping, feature extraction.
6. Tensor Sum (Additive Kernels): $k(\mathbf{x}, \mathbf{x}') = \sum_d k^{(d)}(x_d, x_d')$
Models additive function: $f(\mathbf{x}) = \sum_d f_d(x_d)$. Each dimension contributes independently.
| Combination | Interpretation | Example Use |
|---|---|---|
| SE + SE | Two length scales (multi-scale variation) | Different short and long-range correlations |
| Linear + SE | Linear trend + smooth deviations | Regression with trend |
| Periodic + SE | Seasonality + long-term trend | Seasonal time series |
| Periodic × SE | Decaying periodicity | Damped oscillations |
| Linear × SE | Increasing/decreasing amplitude | Heteroscedastic patterns |
| Linear × Periodic | Growing/shrinking oscillation amplitude | Expanding seasonal effects |
David Duvenaud's thesis introduced the idea of 'Automatic Statistician' that searches over kernel combinations to discover structure in data. Addition corresponds to 'AND' (both components present), multiplication to 'INTERACTING' (one modulates another). Complex kernels become interpretable through this compositional lens.
In many applications, not all input dimensions are equally relevant. Some may be highly predictive, others nearly irrelevant. Automatic Relevance Determination (ARD) kernels learn this automatically by having a separate length scale per dimension.
ARD Squared Exponential: $$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2} \sum_{d=1}^{D} \frac{(x_d - x_d')^2}{\ell_d^2}\right)$$
Equivalently: $$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2} (\mathbf{x} - \mathbf{x}')^\top \mathbf{M} (\mathbf{x} - \mathbf{x}')\right)$$
where $\mathbf{M} = \text{diag}(\ell_1^{-2}, \ldots, \ell_D^{-2})$ is the inverse length scale matrix.
Interpreting Length Scales:
Feature Selection via ARD:
ARD provides soft feature selection. After training, examining ${\ell_d}$ reveals which inputs matter:
This is more principled than manual feature selection—the data determines relevance.
Computational Cost: ARD adds $D$ hyperparameters (one per dimension). For high-dimensional problems, this can make optimization challenging. Regularization or informed priors on length scales help.
Initialize all length scales to the same value (e.g., median pairwise distance). Optimize with regularization toward this value to prevent overfitting. After training, length scales an order of magnitude larger than others indicate irrelevant dimensions. ARD is particularly valuable in scientific applications where identifying important variables is part of the goal.
The Spectral Mixture (SM) kernel, introduced by Wilson and Adams (2013), provides an expressive family that can approximate any stationary kernel.
Key Idea: By Bochner's theorem, any stationary kernel is the Fourier transform of a spectral density. The SM kernel models this density as a mixture of Gaussians:
$$k(\tau) = \sum_{q=1}^{Q} w_q \exp(-2\pi^2 \tau^2 v_q) \cos(2\pi \tau \mu_q)$$
Hyperparameters (per component $q$):
Interpretation:
Why SM Kernels Are Powerful:
Universality: With enough components, SM can approximate any stationary kernel
Automatic pattern discovery: Optimizing $\mu_q$ values finds spectral peaks—i.e., periodicities in the data
Long-range extrapolation: Because periodicity is explicitly modeled, SM can extrapolate periodic patterns far beyond training data
Interpretability: The learned $\mu_q$ values reveal which frequencies are present in the data
Challenges:
SM kernels shine when: (1) your signal has multiple unknown periodicities, (2) long-range extrapolation of periodic patterns is needed, (3) you want to discover spectral structure automatically. They're particularly useful for audio, climate data, and any domain where Fourier analysis is natural. Start with Q=2-5 components and increase if underfitting.
Choosing the right kernel is the most important modeling decision in GP work. Here's a practical framework:
| Data Characteristic | Recommended Kernel(s) |
|---|---|
| Smooth, continuous, basic regression | RBF or Matérn 5/2 |
| Rough, non-differentiable | Matérn 1/2 or 3/2 |
| Strong periodicity | Periodic (+ RBF for trend) |
| Known linear relationship | Linear (+ RBF for deviation) |
| Multiple length scales | Sum of RBF with different $\ell$ |
| Damped oscillation | Periodic × RBF |
| Growing/shrinking variance | Linear × other kernels |
| High-dimensional with irrelevant features | ARD variant of any kernel |
| Unknown complex periodicity | Spectral Mixture |
You now have comprehensive knowledge of mean and covariance functions—the two pillars that completely specify a Gaussian Process. You understand common kernel families, their properties, how to combine them, and how to choose appropriately for your problem. Proceed to learn about sampling from GP priors and visualizing the functions your kernel implies.