Gp Fundamentals - Learning Module

Loading content...

0/278

Mean and Covariance Functions

The Two Pillars of GP Specification

A Gaussian Process is completely specified by two functions: the mean function $m(\mathbf{x})$ and the covariance function (kernel) $k(\mathbf{x}, \mathbf{x}')$. These two functions encode all our prior beliefs about the function we're trying to learn.

The mean function tells us what function values we expect a priori. The covariance function tells us how function values at different inputs are related—how correlated they are, how quickly correlations decay with distance, what patterns of variation we expect.

The art of GP modeling is choosing $m$ and $k$ to reflect genuine prior knowledge about your problem. This page provides the vocabulary and toolkit to make those choices wisely.

What You Will Learn

By the end of this page, you will understand common mean function choices and when to use them, master the major kernel families (RBF, Matérn, periodic, linear, and more), learn how to combine kernels via addition and multiplication to create sophisticated priors, and develop intuition for how kernel hyperparameters affect function samples.

Mean Functions

The mean function $m: \mathcal{X} \to \mathbb{R}$ specifies the expected value of the function at every input:

$$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$$

It represents the 'center' of our prior distribution in function space—the single function around which samples are distributed.

Common Mean Function Choices:

1. Zero Mean: $m(\mathbf{x}) = 0$

The most common choice. After observing data, the posterior mean becomes non-zero where data informs us. Far from data, predictions revert to zero.

2. Constant Mean: $m(\mathbf{x}) = \mu_0$

Useful when you expect the function to fluctuate around a known level. The constant $\mu_0$ can be a hyperparameter learned from data.

3. Linear Mean: $m(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b$

Captures an expected linear trend. The GP then models deviations from this trend. Common in time series with known drift.

4. Polynomial Mean: $m(\mathbf{x}) = \sum_{j=0}^{p} w_j x^j$

Generalizes linear mean to capture polynomial trends. Be careful of overfitting the mean—the GP becomes less flexible.

5. Basis Function Mean: $m(\mathbf{x}) = \sum_{j=1}^{J} \beta_j h_j(\mathbf{x})$

Arbitrary basis functions $h_j$ encode domain knowledge. Examples: Fourier bases for known periodicities, physics-based functions for scientific applications.

6. Parametric Model Mean: $m(\mathbf{x}) = g(\mathbf{x}; \boldsymbol{\theta})$

Any parametric model can serve as the mean. The GP captures what the parametric model misses—a powerful 'residual modeling' approach.

Practical Guidance on Mean Functions

Start with zero mean. It's simple, requires no extra hyperparameters, and works well after centering data. Add a non-zero mean only when: (1) you have genuine prior knowledge of trends, (2) extrapolation behavior matters and you want specific far-field behavior, or (3) the posterior fit is poor with zero mean. The kernel typically matters more than the mean.

Mean Function Comparison
Type	Formula	Hyperparameters	When to Use
Zero	$0$	None	Default; after centering data
Constant	$\mu_0$	$\mu_0$	Known baseline level
Linear	$\mathbf{w}^\top \mathbf{x} + b$	$\mathbf{w}, b$	Known linear trend
Polynomial	$\sum_j w_j x^j$	${w_j}$	Known polynomial trend
Physics-based	$g(\mathbf{x}; \boldsymbol{\theta})$	Model-specific	Physical laws governing system

Covariance Functions: The Heart of GP Modeling

The covariance function (kernel) $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is the most important component of a GP. It determines:

Smoothness: How differentiable are sample functions?
Length scale: Over what distance do correlations persist?
Variance: How much do function values vary?
Periodicity: Are there repeating patterns?
Input relevance: Which input dimensions matter more?

Formal Requirements: A valid covariance function must produce positive semi-definite Gram matrices for any set of inputs:

$$\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$$

with $\mathbf{K} \succeq 0$ (all eigenvalues non-negative).

Key Kernel Properties to Understand

•Stationarity: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$ — properties don't change with location
•Isotropy: $k(\mathbf{x}, \mathbf{x}') = k(|\mathbf{x} - \mathbf{x}'|)$ — no preferred direction
•Characteristic length scale $\ell$: distance over which correlations decay
•Signal variance $\sigma_f^2$: prior variance of function values, $k(\mathbf{x}, \mathbf{x}) = \sigma_f^2$
•Sample smoothness: determined by kernel behavior near $\mathbf{x} = \mathbf{x}'$

Kernels Encode Prior Beliefs

Every kernel choice is a statement about what functions you consider plausible before seeing data. Choosing an RBF kernel says: 'I believe the true function is very smooth.' Choosing Matérn-1/2 says: 'I expect rougher, less predictable behavior.' Choosing a periodic kernel says: 'I expect repeating patterns.' Make these choices consciously.

The Squared Exponential (RBF) Kernel

The Squared Exponential (SE) kernel, also called the Radial Basis Function (RBF) or Gaussian kernel, is the most widely used kernel in GP applications.

Definition (1D): $$k(x, x') = \sigma_f^2 \exp\left(-\frac{(x - x')^2}{2\ell^2}\right)$$

Definition (Multi-dimensional): $$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

Hyperparameters:

$\sigma_f^2$: signal variance — controls the amplitude of variations
$\ell$: length scale — controls how quickly correlations decay with distance

Key Properties:

Stationary and isotropic
Infinitely differentiable sample paths
Correlation decays smoothly with distance
Universal approximator (can approximate any continuous function arbitrarily well)

rbf_kernel_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    """Squared Exponential (RBF) kernel"""
    sqdist = np.subtract.outer(x1, x2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
# Visualize kernel and samples for different length scales
x = np.linspace(0, 10, 200)
length_scales = [0.5, 1.0, 2.0, 5.0]
 
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
 
for idx, ls in enumerate(length_scales):
    # Plot kernel function (correlation vs distance)
    ax_kernel = axes[0, idx]
    distances = np.linspace(0, 5, 100)
    correlations = np.exp(-0.5 * distances**2 / ls**2)
    ax_kernel.plot(distances, correlations, 'b-', linewidth=2)
    ax_kernel.set_xlabel('Distance |x - x\'|')
    ax_kernel.set_ylabel('k(x, x\')')
    ax_kernel.set_title(f'ℓ = {ls}')
    ax_kernel.axhline(y=0.05, color='r', linestyle='--', alpha=0.5, 
                      label='5% correlation')
    ax_kernel.set_ylim([0, 1.05])
    ax_kernel.grid(True, alpha=0.3)
    
    # Plot GP samples
    ax_samples = axes[1, idx]
    K = rbf_kernel(x, x, length_scale=ls) + 1e-8 * np.eye(len(x))
    L = cholesky(K, lower=True)
    
    np.random.seed(42)
    for i in range(3):
        sample = L @ np.random.randn(len(x))
        ax_samples.plot(x, sample, alpha=0.8)
    
    ax_samples.set_xlabel('x')
    ax_samples.set_ylabel('f(x)')
    ax_samples.set_title(f'Samples (ℓ = {ls})')
    ax_samples.grid(True, alpha=0.3)
    ax_samples.set_ylim([-3, 3])
 
plt.suptitle('RBF Kernel: Effect of Length Scale', fontsize=14)
plt.tight_layout()
plt.show()
 
print("Larger length scale → smoother, more slowly-varying functions")
print("Smaller length scale → more rapidly-varying functions")

RBF Smoothness: Blessing and Curse

The infinite smoothness of RBF samples is often unrealistic. Real physical phenomena (temperature, wind, stock prices) typically have finite-order derivatives. If your data suggests rougher behavior than RBF can capture, consider Matérn kernels. RBF is a good starting point, but not always the best choice.

The Matérn Kernel Family

The Matérn kernel family generalizes the RBF kernel by introducing a smoothness parameter $ u$ that controls sample differentiability.

General Definition: $$k(r) = \sigma_f^2 \frac{2^{1- u}}{\Gamma( u)} \left(\frac{\sqrt{2 u} r}{\ell}\right)^ u K_ u\left(\frac{\sqrt{2 u} r}{\ell}\right)$$

where $r = |\mathbf{x} - \mathbf{x}'|$, $\Gamma$ is the gamma function, and $K_ u$ is the modified Bessel function.

Smoothness Interpretation: Sample functions from Matérn-$ u$ are $\lceil u \rceil - 1$ times differentiable.

Special Cases (commonly used closed forms):

Matérn 1/2 (Exponential): $$k(r) = \sigma_f^2 \exp\left(-\frac{r}{\ell}\right)$$ Samples are continuous but not differentiable (rough).

Matérn 3/2: $$k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3} r}{\ell}\right) \exp\left(-\frac{\sqrt{3} r}{\ell}\right)$$ Samples are once differentiable.

Matérn 5/2: $$k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5} r}{\ell} + \frac{5 r^2}{3 \ell^2}\right) \exp\left(-\frac{\sqrt{5} r}{\ell}\right)$$ Samples are twice differentiable.

Matérn Family: Smoothness vs Flexibility
Kernel	$ u$	Differentiability	Roughness	Typical Use
Matérn 1/2	0.5	Continuous only	Very rough	Rough natural phenomena
Matérn 3/2	1.5	Once differentiable	Moderate	Physical processes
Matérn 5/2	2.5	Twice differentiable	Smooth	Engineering, ML default
RBF	$\infty$	Infinitely differentiable	Very smooth	When smoothness assumed

Matérn 5/2: The Practical Sweet Spot

Matérn 5/2 is increasingly preferred over RBF in practice. It's smooth enough for most applications (twice differentiable) while being more realistic than RBF's infinite smoothness. Machine learning libraries (GPy, GPflow, Botorch) often use Matérn 5/2 as the default. When in doubt, try Matérn 5/2 first.

Periodic Kernels

Periodic kernels encode prior belief that the function repeats with a certain period. This is essential for modeling seasonal patterns, oscillations, and cyclic phenomena.

Standard Periodic Kernel: $$k(x, x') = \sigma_f^2 \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2}\right)$$

Hyperparameters:

$p$: period — the repetition interval
$\ell$: length scale — controls smoothness within each period
$\sigma_f^2$: variance — amplitude of oscillations

Derivation Insight: This kernel is obtained by mapping inputs to a circle of circumference $p$: $$x \mapsto [\cos(2\pi x / p), \sin(2\pi x / p)]$$ and applying an RBF kernel in this 2D embedding. Points separated by exactly period $p$ map to the same location!

Locally Periodic Kernel (Periodic × RBF):

Pure periodicity extends forever—but real patterns often decay over time. Multiplying by an RBF creates functions that are locally periodic but globally decay:

$$k(x, x') = \sigma_f^2 \exp\left(-\frac{2 \sin^2(\pi |x - x'| / p)}{\ell_p^2}\right) \exp\left(-\frac{(x - x')^2}{2\ell_{\text{decay}}^2}\right)$$

This captures phenomena like damped oscillations or seasonal patterns that evolve over time.

Quasi-Periodic Kernel (Periodic + RBF):

Adding (not multiplying) periodic and RBF components captures periodicity with additional non-periodic variation:

$$k(x, x') = k_{\text{periodic}}(x, x') + k_{\text{RBF}}(x, x')$$

This models signals with both a periodic component and a smooth trend.

periodic_kernel_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky
 
def periodic_kernel(x1, x2, period=1.0, length_scale=1.0, variance=1.0):
    """Standard periodic kernel"""
    diff = np.subtract.outer(x1, x2)
    return variance * np.exp(-2 * np.sin(np.pi * np.abs(diff) / period)**2 / length_scale**2)
 
def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    """RBF kernel for combination"""
    sqdist = np.subtract.outer(x1, x2)**2
    return variance * np.exp(-0.5 * sqdist / length_scale**2)
 
# Compare periodic, locally periodic, and quasi-periodic
np.random.seed(42)
x = np.linspace(0, 15, 300)
n = len(x)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Pure periodic
K_periodic = periodic_kernel(x, x, period=2.0, length_scale=0.5) + 1e-8*np.eye(n)
L = cholesky(K_periodic, lower=True)
for i in range(3):
    axes[0, 0].plot(x, L @ np.random.randn(n), alpha=0.8)
axes[0, 0].set_title('Pure Periodic (p=2)')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('f(x)')
axes[0, 0].grid(True, alpha=0.3)
 
# Locally periodic (Periodic × RBF)
K_local = periodic_kernel(x, x, period=2.0, length_scale=0.5) * \
          rbf_kernel(x, x, length_scale=5.0) + 1e-8*np.eye(n)
L = cholesky(K_local, lower=True)
np.random.seed(42)
for i in range(3):
    axes[0, 1].plot(x, L @ np.random.randn(n), alpha=0.8)
axes[0, 1].set_title('Locally Periodic (Periodic × RBF)')
axes[0, 1].set_xlabel('x')
axes[0, 1].grid(True, alpha=0.3)
 
# Quasi-periodic (Periodic + RBF)
K_quasi = periodic_kernel(x, x, period=2.0, length_scale=0.5, variance=0.5) + \
          rbf_kernel(x, x, length_scale=3.0, variance=0.5) + 1e-8*np.eye(n)
L = cholesky(K_quasi, lower=True)
np.random.seed(42)
for i in range(3):
    axes[0, 2].plot(x, L @ np.random.randn(n), alpha=0.8)
axes[0, 2].set_title('Quasi-Periodic (Periodic + RBF)')
axes[0, 2].set_xlabel('x')
axes[0, 2].grid(True, alpha=0.3)
 
# Show kernel functions (correlation vs lag)
lags = np.linspace(0, 10, 200)
zero = np.zeros(1)
 
for idx, (name, k_fn) in enumerate([
    ('Pure Periodic', lambda: periodic_kernel(lags, zero, period=2.0, length_scale=0.5).flatten()),
    ('Locally Periodic', lambda: (periodic_kernel(lags, zero, period=2.0, length_scale=0.5) * 
                                   rbf_kernel(lags, zero, length_scale=5.0)).flatten()),
    ('Quasi-Periodic', lambda: (periodic_kernel(lags, zero, period=2.0, length_scale=0.5, variance=0.5) + 
                                 rbf_kernel(lags, zero, length_scale=3.0, variance=0.5)).flatten())
]):
    axes[1, idx].plot(lags, k_fn(), 'b-', linewidth=2)
    axes[1, idx].set_xlabel('Lag τ')
    axes[1, idx].set_ylabel('k(τ)')
    axes[1, idx].set_title(f'{name} Kernel')
    axes[1, idx].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.show()

Linear and Polynomial Kernels

Linear Kernel: $$k(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - \mathbf{c})^\top (\mathbf{x}' - \mathbf{c})$$

Hyperparameters:

$\sigma_b^2$: bias variance (prior variance of the intercept)
$\sigma_v^2$: slope variance (prior variance of slope coefficients)
$\mathbf{c}$: center point (often set to origin or data mean)

Properties:

Non-stationary: variance depends on distance from $\mathbf{c}$
Samples are linear functions: $f(\mathbf{x}) = a + \mathbf{b}^\top \mathbf{x}$ where $a \sim \mathcal{N}(0, \sigma_b^2)$, $\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \sigma_v^2 \mathbf{I})$
Equivalent to Bayesian linear regression!

Polynomial Kernel: $$k(\mathbf{x}, \mathbf{x}') = (\sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}')^p$$

where $p$ is the polynomial degree.

Properties:

Samples are polynomials up to degree $p$
Non-stationary (variance grows with distance from origin)
Corresponds to feature space of all monomials up to degree $p$

Linear + SE Combination:

A powerful pattern is combining linear and SE kernels: $$k(\mathbf{x}, \mathbf{x}') = k_{\text{linear}}(\mathbf{x}, \mathbf{x}') + k_{\text{SE}}(\mathbf{x}, \mathbf{x}')$$

This captures both a linear trend and smooth deviations from that trend. Common in modeling phenomena with known linear behavior plus noise.

When to Use Linear Kernels

Linear kernels are perfect for expressing 'I believe the relationship is approximately linear.' Combined with other kernels, they capture trends: Linear + SE for linear trend + smooth deviation, Linear × Periodic for linearly changing amplitude of oscillation. The linear kernel is often underrated—simple but powerful.

Kernel Composition: Building Complex Priors

The power of kernel methods lies in compositionality: simple kernels can be combined to create sophisticated priors that capture complex patterns.

Valid Operations on Kernels:

1. Addition: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$

Represents superposition of independent components. If $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent, then $f_1 + f_2 \sim \mathcal{GP}(0, k_1 + k_2)$.

Use case: Trend + seasonal + noise decomposition.

2. Multiplication: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$

Represents interaction/modulation. Functions from $k_1$ modulate the structure of $k_2$.

Use case: Periodicity that decays (Periodic × RBF), varying length scale (Linear × SE).

3. Scaling: $k(\mathbf{x}, \mathbf{x}') = c \cdot k_1(\mathbf{x}, \mathbf{x}')$ for $c > 0$

Scales the variance of the component.

4. Exponentiation: $k(\mathbf{x}, \mathbf{x}') = \exp(k_1(\mathbf{x}, \mathbf{x}'))$

Transforms kernel values; preserves PSD property.

5. Composition with Functions: $k(\mathbf{x}, \mathbf{x}') = k_1(g(\mathbf{x}), g(\mathbf{x}'))$

Applying any function $g$ to inputs before kernelization is valid. Used for input warping, feature extraction.

6. Tensor Sum (Additive Kernels): $k(\mathbf{x}, \mathbf{x}') = \sum_d k^{(d)}(x_d, x_d')$

Models additive function: $f(\mathbf{x}) = \sum_d f_d(x_d)$. Each dimension contributes independently.

Common Kernel Combinations and Their Meanings
Combination	Interpretation	Example Use
SE + SE	Two length scales (multi-scale variation)	Different short and long-range correlations
Linear + SE	Linear trend + smooth deviations	Regression with trend
Periodic + SE	Seasonality + long-term trend	Seasonal time series
Periodic × SE	Decaying periodicity	Damped oscillations
Linear × SE	Increasing/decreasing amplitude	Heteroscedastic patterns
Linear × Periodic	Growing/shrinking oscillation amplitude	Expanding seasonal effects

The Grammar of Kernels

David Duvenaud's thesis introduced the idea of 'Automatic Statistician' that searches over kernel combinations to discover structure in data. Addition corresponds to 'AND' (both components present), multiplication to 'INTERACTING' (one modulates another). Complex kernels become interpretable through this compositional lens.

Automatic Relevance Determination (ARD)

In many applications, not all input dimensions are equally relevant. Some may be highly predictive, others nearly irrelevant. Automatic Relevance Determination (ARD) kernels learn this automatically by having a separate length scale per dimension.

ARD Squared Exponential: $$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2} \sum_{d=1}^{D} \frac{(x_d - x_d')^2}{\ell_d^2}\right)$$

Equivalently: $$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2} (\mathbf{x} - \mathbf{x}')^\top \mathbf{M} (\mathbf{x} - \mathbf{x}')\right)$$

where $\mathbf{M} = \text{diag}(\ell_1^{-2}, \ldots, \ell_D^{-2})$ is the inverse length scale matrix.

Interpreting Length Scales:

Large $\ell_d$: Dimension $d$ has minimal effect. The function varies slowly in that direction.
Small $\ell_d$: Dimension $d$ is highly relevant. The function varies rapidly.
$\ell_d \to \infty$: Dimension $d$ is completely irrelevant—effectively removed.

Feature Selection via ARD:

ARD provides soft feature selection. After training, examining ${\ell_d}$ reveals which inputs matter:

Dimensions with $\ell_d \ll$ others are predictive
Dimensions with $\ell_d \gg$ others can be ignored

This is more principled than manual feature selection—the data determines relevance.

Computational Cost: ARD adds $D$ hyperparameters (one per dimension). For high-dimensional problems, this can make optimization challenging. Regularization or informed priors on length scales help.

Practical ARD Usage

Initialize all length scales to the same value (e.g., median pairwise distance). Optimize with regularization toward this value to prevent overfitting. After training, length scales an order of magnitude larger than others indicate irrelevant dimensions. ARD is particularly valuable in scientific applications where identifying important variables is part of the goal.

Spectral Mixture Kernels

The Spectral Mixture (SM) kernel, introduced by Wilson and Adams (2013), provides an expressive family that can approximate any stationary kernel.

Key Idea: By Bochner's theorem, any stationary kernel is the Fourier transform of a spectral density. The SM kernel models this density as a mixture of Gaussians:

$$k(\tau) = \sum_{q=1}^{Q} w_q \exp(-2\pi^2 \tau^2 v_q) \cos(2\pi \tau \mu_q)$$

Hyperparameters (per component $q$):

$w_q$: weight (contribution of this component)
$\mu_q$: frequency (center of spectral component)
$v_q$: inverse bandwidth (how localized in frequency space)

Interpretation:

Each component represents a narrow band in frequency space
$\mu_q$ determines the characteristic frequency (periodicity)
$v_q$ determines how pure vs. spread the frequency is
More components → more complex spectral structure

Why SM Kernels Are Powerful:

Universality: With enough components, SM can approximate any stationary kernel
Automatic pattern discovery: Optimizing $\mu_q$ values finds spectral peaks—i.e., periodicities in the data
Long-range extrapolation: Because periodicity is explicitly modeled, SM can extrapolate periodic patterns far beyond training data
Interpretability: The learned $\mu_q$ values reveal which frequencies are present in the data

Challenges:

Many hyperparameters ($3Q$ for $Q$ components)
Non-convex optimization with multiple local minima
Requires careful initialization (often from periodogram analysis)

When to Use Spectral Mixture

SM kernels shine when: (1) your signal has multiple unknown periodicities, (2) long-range extrapolation of periodic patterns is needed, (3) you want to discover spectral structure automatically. They're particularly useful for audio, climate data, and any domain where Fourier analysis is natural. Start with Q=2-5 components and increase if underfitting.

Practical Kernel Selection

Choosing the right kernel is the most important modeling decision in GP work. Here's a practical framework:

Kernel Selection Workflow

•Start with Matérn 5/2 or RBF: These are sensible defaults for smooth, continuous functions. Matérn 5/2 is more realistic; RBF is simpler.
•Consider the physics: Is there periodicity? Use periodic kernel. Linear trend? Add linear kernel. Multiple length scales? Add kernels of different scales.
•Visualize samples from your prior: Before fitting, sample from your proposed kernel. Do the samples look plausible for your domain? This reveals mismatches.
•Use model comparison: Fit multiple kernel choices and compare marginal likelihood (the 'Bayesian Occam's razor'). Higher is better.
•Examine residuals: After fitting, do residuals show structure (periodicity, trends) the model missed? Add kernel components to capture this.
•Consider interpretability: Simpler kernels (fewer components, terms) are easier to understand and less prone to overfitting.

Kernel Selection Quick Guide
Data Characteristic	Recommended Kernel(s)
Smooth, continuous, basic regression	RBF or Matérn 5/2
Rough, non-differentiable	Matérn 1/2 or 3/2
Strong periodicity	Periodic (+ RBF for trend)
Known linear relationship	Linear (+ RBF for deviation)
Multiple length scales	Sum of RBF with different $\ell$
Damped oscillation	Periodic × RBF
Growing/shrinking variance	Linear × other kernels
High-dimensional with irrelevant features	ARD variant of any kernel
Unknown complex periodicity	Spectral Mixture

Page Complete

You now have comprehensive knowledge of mean and covariance functions—the two pillars that completely specify a Gaussian Process. You understand common kernel families, their properties, how to combine them, and how to choose appropriately for your problem. Proceed to learn about sampling from GP priors and visualizing the functions your kernel implies.