Machine LearningGaussian Processes

Kernel Design for Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

1 / 5

Stationary Kernels

The Heart of Gaussian Processes

If Gaussian Processes are the engine of non-parametric Bayesian inference, then kernels are the soul that determines what kinds of functions we believe in. Every choice we make about the kernel directly translates into assumptions about function smoothness, periodicity, correlation length, and ultimate behavior.

This page focuses on stationary kernels—also called shift-invariant kernels—which represent the most widely-used class of covariance functions. Stationary kernels encode the idea that the statistical relationship between function values depends only on the distance between input points, not their absolute location. This property makes them extraordinarily versatile for modeling phenomena where we expect translation invariance.

What You Will Learn

By the end of this page, you will understand: (1) the formal definition and properties of stationary kernels, (2) the mathematical structure of the RBF, Matérn, and Periodic kernels, (3) how each kernel encodes different prior beliefs about function smoothness, (4) the spectral interpretation of stationarity, and (5) practical guidelines for selecting stationary kernels for your applications.

What Makes a Kernel Stationary

Before diving into specific kernels, we must establish precisely what stationarity means in the context of covariance functions. This concept is foundational to understanding why certain kernels are appropriate for certain problems.

Definition: Stationary Kernel

A kernel (covariance function) $k(\mathbf{x}, \mathbf{x}')$ is stationary if it can be expressed as a function of only the difference $\mathbf{x} - \mathbf{x}'$ rather than the absolute positions:

$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$$

This means the covariance between any two function values $f(\mathbf{x})$ and $f(\mathbf{x}')$ depends only on the vector separating them, not where they are in input space. If we translate both points by the same amount, the covariance remains unchanged.

Definition: Isotropic Kernel

A stationary kernel is isotropic (or radial) if it depends only on the Euclidean distance $r = |\mathbf{x} - \mathbf{x}'|$:

$$k(\mathbf{x}, \mathbf{x}') = k(r) \quad \text{where } r = |\mathbf{x} - \mathbf{x}'|$$

Isotropic kernels are rotationally invariant—they treat all directions in input space identically. Most common stationary kernels are isotropic.

Stationarity Classification of Kernels
Kernel Type	Depends On	Invariance Property	Example
Stationary & Isotropic	Distance $r = \|\mathbf{x} - \mathbf{x}'\|$	Translation + Rotation	RBF, Matérn
Stationary but Anisotropic	Vector $\mathbf{x} - \mathbf{x}'$	Translation only	ARD kernels
Non-stationary	Absolute positions $\mathbf{x}, \mathbf{x}'$	Neither	Neural Network kernel, Polynomial kernel

Why Stationarity Matters

Stationary kernels encode the prior belief that the underlying function has the same statistical properties everywhere in the input domain. This is appropriate when there's no reason to believe that correlations or smoothness should change across the input space. For example, temperature variations across a homogeneous field, or signal processing where the noise characteristics don't change over time.

Connection to Stochastic Processes

The term 'stationary' in kernel theory directly connects to stationarity in stochastic processes. A Gaussian Process with a stationary kernel is a wide-sense stationary stochastic process, meaning:

The mean function is constant (typically zero)
The autocovariance function depends only on the lag (difference in inputs)

This connection provides powerful tools for analysis. In particular, Bochner's theorem gives us a complete characterization: a continuous, stationary kernel is positive definite if and only if it is the Fourier transform of a finite, non-negative measure (the spectral measure). This spectral representation is the foundation of spectral analysis of GPs, which we'll explore in a later page.

The Positive Definiteness Requirement

Not every function of $|\mathbf{x} - \mathbf{x}'|$ is a valid kernel. A kernel must be positive semi-definite (PSD): for any set of points ${\mathbf{x}_1, ..., \mathbf{x}n}$, the resulting covariance matrix $\mathbf{K}$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semi-definite (all eigenvalues $\geq 0$).

This constraint ensures that the GP defines a valid probability distribution. Violating it leads to nonsensical 'negative variances.' Fortunately, the kernels we discuss here are all provably PSD.

The Radial Basis Function (RBF) Kernel

The Radial Basis Function (RBF) kernel, also known as the Squared Exponential or Gaussian kernel, is the most widely used kernel in GP practice. Its ubiquity stems from its simplicity, smoothness properties, and well-understood behavior.

Mathematical Definition

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

Where:

$\sigma^2$ is the output variance (signal variance): controls the overall vertical scale of function variations
$\ell$ is the length-scale: controls how quickly the correlation decays with distance

Alternatively, using the squared distance $r^2 = |\mathbf{x} - \mathbf{x}'|^2$:

$$k_{\text{RBF}}(r) = \sigma^2 \exp\left(-\frac{r^2}{2\ell^2}\right)$$

Infinite Smoothness

The RBF kernel produces functions that are infinitely differentiable (C^∞). This is both its strength and weakness. It's perfect for modeling smooth phenomena but leads to unrealistically smooth functions for data with rough or irregular behavior. The mean-squared differentiability is directly connected to the behavior of the kernel at the origin.

Understanding the Length-Scale Parameter

The length-scale $\ell$ is perhaps the most important hyperparameter in GP modeling. It answers the question: over what distance do function values remain correlated?

Small $\ell$: Rapid decorrelation. Function can vary quickly over short distances. More wiggly samples.
Large $\ell$: Slow decorrelation. Function values remain similar over large distances. Smoother, more slowly-varying samples.

Rule of Thumb: If $|\mathbf{x} - \mathbf{x}'| \gg \ell$, then $k(\mathbf{x}, \mathbf{x}') \approx 0$. Points separated by more than a few length-scales are essentially uncorrelated.

Quantitative Relationship:

Distance	Correlation (as fraction of $\sigma^2$)
$r = 0$	1.000 (perfect correlation)
$r = \ell$	0.606
$r = 2\ell$	0.135
$r = 3\ell$	0.011

This exponential decay is characteristic of the RBF kernel.

rbf_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
 
def rbf_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0):
    """
    Compute the RBF (Squared Exponential) kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
        First set of input points
    X2 : ndarray of shape (n2, d)
        Second set of input points
    sigma_sq : float
        Signal variance (output scale)
    length_scale : float
        Length-scale (determines correlation decay rate)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
        Kernel matrix where K[i,j] = k(X1[i], X2[j])
    
    Mathematical Definition:
    k(x, x') = σ² · exp(-||x - x'||² / (2ℓ²))
    """
    # Compute pairwise squared Euclidean distances
    sq_dists = cdist(X1, X2, metric='sqeuclidean')
    
    # Apply the RBF kernel formula
    K = sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
    
    return K
 
def visualize_rbf_samples(length_scales=[0.5, 1.0, 2.0], n_samples=5):
    """
    Visualize GP samples with different length-scales to build intuition.
    """
    X = np.linspace(0, 10, 200).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, len(length_scales), figsize=(15, 4))
    
    for ax, ell in zip(axes, length_scales):
        K = rbf_kernel(X, X, sigma_sq=1.0, length_scale=ell)
        # Add small jitter for numerical stability
        K += 1e-8 * np.eye(len(X))
        
        # Sample from the GP prior
        samples = np.random.multivariate_normal(
            mean=np.zeros(len(X)), 
            cov=K, 
            size=n_samples
        )
        
        for sample in samples:
            ax.plot(X, sample, alpha=0.7)
        
        ax.set_title(f'Length-scale ℓ = {ell}')
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('RBF Kernel: Effect of Length-Scale on GP Samples')
    plt.tight_layout()
    return fig
 
# Example: Compare correlation decay
distances = np.array([0, 0.5, 1.0, 1.5, 2.0, 3.0])
ell = 1.0
correlations = np.exp(-distances**2 / (2 * ell**2))
print("Distance vs Correlation (ℓ = 1.0):")
for d, c in zip(distances, correlations):
    print(f"  r = {d:.1f}: correlation = {c:.4f}")

When to Use the RBF Kernel

✅ Appropriate for:

Smooth, slowly-varying functions without sharp transitions
Physical processes governed by diffusion or wave equations
Initial exploration when you have no strong prior beliefs about function roughness
Problems where infinite differentiability is a reasonable assumption

❌ Inappropriate for:

Functions with discontinuities or sharp transitions
Rough textures in spatial modeling (terrain, surfaces)
Time series with irregular patterns or sudden changes
Any domain where infinite smoothness is too strong an assumption

The Over-Smoothing Problem

A notorious issue with the RBF kernel is its tendency toward over-smoothing or over-confidence. Because it assumes infinite differentiability, the posterior predictions can be overly certain in regions between data points, failing to capture the true uncertainty when the underlying function is rough.

The Matérn Family of Kernels

The Matérn kernel family addresses the RBF kernel's over-smoothness problem by providing explicit control over function differentiability. This makes it arguably the most versatile family of stationary kernels for practical applications.

Mathematical Definition

The general Matérn kernel is:

$$k_{\text{Matérn}}(r) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}, r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}, r}{\ell}\right)$$

Where:

$r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance
$\sigma^2$ is the output variance
$\ell$ is the length-scale
$\nu > 0$ is the smoothness parameter (not a length-scale!)
$K_\nu$ is the modified Bessel function of the second kind
$\Gamma$ is the Gamma function

The Smoothness Parameter ν

The parameter $\nu$ directly controls the mean-squared differentiability of the GP samples:

A GP with the Matérn kernel is $\lceil \nu \rceil - 1$ times differentiable (in the mean-squared sense)
As $\nu \to \infty$, the Matérn kernel converges to the RBF kernel

Matérn Kernel Special Cases and Their Properties
ν Value	Common Name	Differentiability	Formula (Simplified)	Typical Use Case
ν = 1/2	Exponential / Matérn-1/2	C^0 (continuous, not differentiable)	$\sigma^2 \exp(-r/\ell)$	Rough, non-smooth functions (e.g., Brownian motion in space)
ν = 3/2	Matérn-3/2	C^1 (once differentiable)	$\sigma^2 (1 + \frac{\sqrt{3}r}{\ell}) \exp(-\frac{\sqrt{3}r}{\ell})$	Moderately smooth functions, physical processes
ν = 5/2	Matérn-5/2	C^2 (twice differentiable)	$\sigma^2 (1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}) \exp(-\frac{\sqrt{5}r}{\ell})$	Smooth functions, most common default
ν → ∞	RBF / Squared Exponential	C^∞ (infinitely differentiable)	$\sigma^2 \exp(-\frac{r^2}{2\ell^2})$	Very smooth, analytic functions

The Matérn-5/2 Sweet Spot

In practice, Matérn-5/2 has emerged as the default recommendation for many applications. It's smooth enough for typical physical processes (twice differentiable) while avoiding the over-confidence issues of the RBF kernel. Many practitioners and libraries (like GPyTorch, GPflow) use Matérn-5/2 as their default kernel.

matern_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from scipy.special import gamma, kv as bessel_k
from scipy.spatial.distance import cdist
 
def matern_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0, nu=2.5):
    """
    Compute the Matérn kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    sigma_sq : float
        Signal variance
    length_scale : float
        Length-scale parameter
    nu : float
        Smoothness parameter (common: 0.5, 1.5, 2.5)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    # Compute pairwise Euclidean distances
    dists = cdist(X1, X2, metric='euclidean')
    
    # Handle special cases efficiently
    if nu == 0.5:
        # Matérn-1/2: Exponential kernel
        return sigma_sq * np.exp(-dists / length_scale)
    
    elif nu == 1.5:
        # Matérn-3/2
        sqrt3_r_over_ell = np.sqrt(3) * dists / length_scale
        return sigma_sq * (1 + sqrt3_r_over_ell) * np.exp(-sqrt3_r_over_ell)
    
    elif nu == 2.5:
        # Matérn-5/2
        sqrt5_r_over_ell = np.sqrt(5) * dists / length_scale
        return sigma_sq * (1 + sqrt5_r_over_ell + 
                          (sqrt5_r_over_ell**2) / 3) * np.exp(-sqrt5_r_over_ell)
    
    else:
        # General case using Bessel function
        scaled = np.sqrt(2 * nu) * dists / length_scale
        # Handle r=0 separately (limit is 1)
        K = np.ones_like(dists) * sigma_sq
        nonzero = dists > 1e-10
        K[nonzero] = sigma_sq * (
            (2**(1-nu)) / gamma(nu) * 
            (scaled[nonzero]**nu) * 
            bessel_k(nu, scaled[nonzero])
        )
        return K
 
def compare_matern_smoothness():
    """
    Visualize how the smoothness parameter affects GP samples.
    This demonstrates the practical difference between ν values.
    """
    import matplotlib.pyplot as plt
    
    X = np.linspace(0, 10, 300).reshape(-1, 1)
    nu_values = [0.5, 1.5, 2.5, 5.0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes = axes.flatten()
    
    np.random.seed(42)  # For reproducibility
    
    for ax, nu in zip(axes, nu_values):
        K = matern_kernel(X, X, sigma_sq=1.0, length_scale=1.0, nu=nu)
        K += 1e-6 * np.eye(len(X))  # Numerical stability
        
        # Sample from GP prior
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X), 5)
        
        for i in range(samples.shape[1]):
            ax.plot(X, samples[:, i], alpha=0.7)
        
        diff_order = int(np.ceil(nu) - 1) if nu < 10 else '∞'
        ax.set_title(f'Matérn-{nu} (C^{diff_order} smoothness)')
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Matérn Kernel: Effect of Smoothness Parameter ν', fontsize=14)
    plt.tight_layout()
    return fig

Choosing the Smoothness Parameter

Unlike the length-scale, which is typically learned from data, the smoothness parameter $\nu$ often requires domain knowledge or careful consideration:

Physical constraints: If derivatives of the modeled function have physical meaning (velocity, acceleration), choose $\nu$ to ensure they exist
Visual inspection: Examine your data—does it appear smooth or rough? Sharp transitions suggest lower $\nu$
Model comparison: Use marginal likelihood to compare different $\nu$ values
Default to Matérn-5/2: When uncertain, this provides a good balance of smoothness and flexibility

Why Not Learn ν?

While $\nu$ can be treated as a hyperparameter to optimize, this is often problematic:

The marginal likelihood landscape with respect to $\nu$ is typically flat
Small changes in $\nu$ can dramatically change the posterior
The interaction between $\nu$ and $\ell$ can make optimization difficult

For these reasons, practitioners usually fix $\nu$ to one of the special cases (0.5, 1.5, 2.5) and learn only $\sigma^2$ and $\ell$.

The Periodic Kernel

Many real-world phenomena exhibit periodicity: daily temperature cycles, seasonal patterns in sales data, weekly traffic patterns, or annual biological rhythms. The periodic kernel encodes this structure directly into the GP prior.

Mathematical Definition

$$k_{\text{Periodic}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{2\sin^2\left(\pi |\mathbf{x} - \mathbf{x}'| / p\right)}{\ell^2}\right)$$

Where:

$\sigma^2$ is the output variance
$p$ is the period (the distance after which the pattern repeats)
$\ell$ is the length-scale (controls smoothness within each period)

Key Properties

Exact periodicity: $k(x, x') = k(x+p, x'+p)$ — the covariance structure repeats exactly every $p$ units
Cross-period correlation: Points separated by exactly one period ($|x - x'| = p$) have the same correlation as points at the same location ($|x - x'| = 0$)
Infinite repetition: The kernel assumes the periodic pattern continues forever in both directions

Strict Periodicity Can Be Limiting

The standard periodic kernel assumes perfect, unchanging periodicity. Real data often has periods that evolve over time (e.g., lengthening days across months) or decay in amplitude. For such cases, consider multiplying the periodic kernel by an RBF kernel (creating a 'locally periodic' kernel) or using specialized quasi-periodic kernels.

periodic_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.spatial.distance import cdist
 
def periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale=1.0):
    """
    Compute the Periodic (ExpSineSquared) kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    sigma_sq : float
        Signal variance
    period : float
        Period of the repeating pattern
    length_scale : float
        Length-scale (controls smoothness within period)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    # For 1D data, compute distance directly
    if X1.ndim == 1:
        X1 = X1.reshape(-1, 1)
    if X2.ndim == 1:
        X2 = X2.reshape(-1, 1)
    
    # Compute pairwise distances (for 1D, this is just |x - x'|)
    dists = cdist(X1, X2, metric='euclidean')
    
    # Apply periodic kernel formula
    sin_term = np.sin(np.pi * dists / period)
    K = sigma_sq * np.exp(-2 * (sin_term / length_scale)**2)
    
    return K
 
def locally_periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, 
                            length_scale_periodic=1.0, length_scale_decay=10.0):
    """
    Locally Periodic Kernel: Periodic × RBF
    
    This models patterns that are approximately periodic but can 
    change smoothly over time (e.g., amplitude decay, period drift).
    
    k(x,x') = k_periodic(x,x') × k_rbf(x,x')
    """
    # Periodic component
    dists = cdist(X1, X2, metric='euclidean')
    sin_term = np.sin(np.pi * dists / period)
    K_periodic = np.exp(-2 * (sin_term / length_scale_periodic)**2)
    
    # RBF decay component
    sq_dists = cdist(X1, X2, metric='sqeuclidean')
    K_rbf = np.exp(-sq_dists / (2 * length_scale_decay**2))
    
    # Combined locally-periodic kernel
    return sigma_sq * K_periodic * K_rbf
 
# Demonstration: Modeling daily temperature patterns
def demo_periodic_modeling():
    import matplotlib.pyplot as plt
    
    # Generate synthetic temperature data with daily periodicity
    np.random.seed(42)
    hours = np.linspace(0, 72, 73)  # 3 days, hourly
    
    # True underlying function: daily cycle with some noise
    true_temp = 20 + 8 * np.sin(2 * np.pi * hours / 24 - np.pi/2)  # Peak at 2PM
    observed = true_temp + np.random.randn(len(hours)) * 1.5
    
    # Sample observed points (missing some hours)
    train_idx = np.random.choice(len(hours), 40, replace=False)
    train_idx.sort()
    X_train = hours[train_idx].reshape(-1, 1)
    y_train = observed[train_idx]
    
    # Predict on full grid
    X_test = np.linspace(0, 72, 200).reshape(-1, 1)
    
    # Build kernel with known period (24 hours)
    K_train = periodic_kernel(X_train, X_train, sigma_sq=64.0, period=24.0, length_scale=3.0)
    K_train += 2.25 * np.eye(len(X_train))  # Noise variance
    
    K_test_train = periodic_kernel(X_test, X_train, sigma_sq=64.0, period=24.0, length_scale=3.0)
    K_test = periodic_kernel(X_test, X_test, sigma_sq=64.0, period=24.0, length_scale=3.0)
    
    # GP posterior
    L = np.linalg.cholesky(K_train)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    
    mean = K_test_train @ alpha
    v = np.linalg.solve(L, K_test_train.T)
    var = np.diag(K_test) - np.sum(v**2, axis=0)
    std = np.sqrt(var)
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.fill_between(X_test.flatten(), mean - 2*std, mean + 2*std, 
                    alpha=0.3, label='95% CI')
    ax.plot(X_test, mean, 'b-', label='GP Posterior Mean')
    ax.plot(hours, true_temp, 'g--', alpha=0.7, label='True Function')
    ax.scatter(X_train, y_train, c='red', s=30, zorder=5, label='Observations')
    
    ax.set_xlabel('Hours')
    ax.set_ylabel('Temperature (°C)')
    ax.set_title('Periodic Kernel: Modeling Daily Temperature Cycles')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

Understanding the Period and Length-Scale Interaction

The periodic kernel has a subtle interplay between its parameters:

Period $p$: Determines where the pattern repeats. This is often known from domain knowledge (24 hours for daily, 365.25 days for yearly)
Length-scale $\ell$: Controls how smoothly the pattern varies within each period. Smaller $\ell$ allows more wiggly variations within the period; larger $\ell$ produces smoother within-period behavior

When $\ell \ll p$: The function can vary rapidly within each period When $\ell \gg p$: The function becomes nearly sinusoidal (only the fundamental frequency matters)

Extensions and Variations

Locally Periodic: $k_{\text{LP}} = k_{\text{Periodic}} \times k_{\text{RBF}}$
- Captures periodic patterns that decay or evolve over time
Quasi-Periodic: Period itself varies smoothly
- Useful for phenomena with drifting frequencies
Multiple Periods: Sum of periodic kernels with different periods
- Models multi-frequency phenomena (e.g., daily + weekly + yearly)

Other Important Stationary Kernels

Beyond the RBF, Matérn, and Periodic kernels, several other stationary kernels serve specific purposes. Understanding their characteristics helps you select the right tool for specialized applications.

Additional Stationary Kernels

•Rational Quadratic (RQ) Kernel: Equivalent to an infinite sum of RBF kernels with different length-scales. Useful when the function exhibits variations at multiple scales. Formula: $k_{\text{RQ}}(r) = \sigma^2 \left(1 + \frac{r^2}{2\alpha\ell^2}\right)^{-\alpha}$
•Ornstein-Uhlenbeck Kernel: Identical to Matérn-1/2. Models continuous but non-differentiable processes. Arises naturally in physics (Brownian motion). Formula: $k_{\text{OU}}(r) = \sigma^2 \exp(-r/\ell)$
•Spectral Mixture Kernel: Represents the kernel through its Fourier spectrum as a mixture of Gaussians. Extremely flexible, can approximate any stationary kernel. Invented by Andrew Gordon Wilson (2013).
•Cosine Kernel: Models purely oscillatory behavior without decay. Formula: $k_{\text{cos}}(r) = \sigma^2 \cos(2\pi r / p)$. Often combined with decay-inducing kernels.

Comparison of Stationary Kernels
Kernel	Smoothness	Parameters	Key Feature	Computational Cost
RBF	C^∞	σ², ℓ	Infinite smoothness	Low
Matérn-5/2	C^2	σ², ℓ	Controllable smoothness	Low
Matérn-3/2	C^1	σ², ℓ	Once differentiable	Low
Matérn-1/2	C^0	σ², ℓ	Continuous only (rough)	Low
Periodic	C^∞	σ², ℓ, p	Exact periodicity	Low
Rational Quadratic	C^∞	σ², ℓ, α	Multi-scale smoothness	Low
Spectral Mixture	Varies	Many	Extreme flexibility	Medium-High

The Rational Quadratic Kernel

The Rational Quadratic (RQ) kernel deserves special attention due to its unique interpretation:

$$k_{\text{RQ}}(r) = \sigma^2 \left(1 + \frac{r^2}{2\alpha\ell^2}\right)^{-\alpha}$$

The parameter $\alpha > 0$ controls how the different length-scales are mixed:

As $\alpha \to \infty$: RQ kernel → RBF kernel
Small $\alpha$: Significant contribution from both short and long length-scales

Mathematical Insight: The RQ kernel can be derived as a scale mixture of RBF kernels where the length-scale follows an inverse-gamma distribution. This makes it ideal for functions that exhibit multi-scale behavior—smooth in some regions, with finer variations in others.

Practical Guidance:

Start with simpler kernels (RBF or Matérn-5/2) unless you have evidence of multi-scale structure
RQ can help when a single length-scale seems inadequate but you don't want to commit to a specific composition
The extra parameter $\alpha$ adds flexibility but also complexity in hyperparameter optimization

Practical Kernel Selection Guidelines

Selecting the right kernel is both an art and a science. While hyperparameters can be learned from data, the kernel family encodes structural assumptions that must often come from domain knowledge or exploratory analysis.

A Systematic Selection Process

Step-by-Step Kernel Selection

•Assess Smoothness Requirements: Are derivatives of your function meaningful? Do you expect rough or smooth behavior? → Choose Matérn-ν accordingly, or RBF for very smooth functions
•Check for Periodicity: Does your data exhibit repeating patterns on known timescales? → Include a periodic component
•Consider Multi-Scale Structure: Does the data show variations at multiple length-scales? → Consider Rational Quadratic or kernel composition
•Evaluate Stationarity Assumption: Is it reasonable that correlations are the same everywhere in the input domain? → If not, consider non-stationary kernels (covered later)
•Start Simple: Begin with Matérn-5/2 or RBF, add complexity only when needed
•Validate with Marginal Likelihood: Compare candidate kernels using log marginal likelihood
•Cross-Validate Predictions: Ensure the chosen kernel generalizes to held-out data

When to Use Stationary Kernels

✓ Homogeneous domains (no spatial variation in properties) ✓ Temporal data without trend or drift ✓ Functions where smoothness is constant ✓ As building blocks for more complex kernels

When Stationary Kernels Fail

✗ Functions with varying smoothness ✗ Data with trends or drifts ✗ Boundaries with different behavior ✗ Multi-fidelity or hierarchical data

Common Pitfalls and How to Avoid Them

Over-Reliance on RBF: The RBF kernel is the 'Gaussian distribution of kernels'—simple and often good enough, but its infinite smoothness assumption is rarely accurate. Default to Matérn-5/2 instead.
Ignoring Periodicity: Failing to model known periodic structure leads to poor predictions. If you know the period, encode it!
Too Many Parameters: Complex kernels with many hyperparameters can overfit or be hard to optimize. Start simple and add complexity incrementally.
Forgetting Scaling: Input features should typically be normalized. A length-scale of 1.0 means different things for features ranging [0,1] vs [0,1000].
Not Visualizing Samples: Before fitting, sample from the prior to check if the kernel produces reasonable-looking functions. If prior samples don't resemble plausible data, reconsider your kernel choice.

Summary: Stationary Kernels

Stationary kernels form the foundation of most Gaussian Process applications. Their shift-invariance property—that correlations depend only on distances, not absolute positions—makes them versatile and interpretable.

Key Takeaways

•Stationary ⟹ Translation-Invariant: Covariance depends on $\mathbf{x} - \mathbf{x}'$, not absolute position
•RBF (Squared Exponential): Infinitely smooth, simple, but often over-smooths real data
•Matérn Family: Controllable smoothness via $\nu$; Matérn-5/2 is the practical default
•Periodic Kernel: Encodes exact periodicity; combine with RBF for locally-periodic patterns
•Length-scale $\ell$: The fundamental scale of input-space correlations; learned from data
•Kernel Selection: Start simple, validate with marginal likelihood, visualize prior samples

What's Next:

Stationary kernels assume uniform properties across the input space. In the next page, we'll explore non-stationary kernels that can model functions whose characteristics vary with location—essential for handling trends, boundaries, and heterogeneous domains.

Page Complete

You now understand stationary kernels: their mathematical foundations, the major examples (RBF, Matérn, Periodic), and when to apply each. This knowledge forms the vocabulary for discussing GP priors and prepares you for the greater flexibility of non-stationary and composite kernels.

1 / 5

Loading learning content...

Machine LearningGaussian Processes

Kernel Design for Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

1 / 5

Stationary Kernels

The Heart of Gaussian Processes

What You Will Learn

What Makes a Kernel Stationary

Definition: Stationary Kernel

$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$$

Definition: Isotropic Kernel

A stationary kernel is isotropic (or radial) if it depends only on the Euclidean distance $r = |\mathbf{x} - \mathbf{x}'|$:

$$k(\mathbf{x}, \mathbf{x}') = k(r) \quad \text{where } r = |\mathbf{x} - \mathbf{x}'|$$

Isotropic kernels are rotationally invariant—they treat all directions in input space identically. Most common stationary kernels are isotropic.

Stationarity Classification of Kernels
Kernel Type	Depends On	Invariance Property	Example
Stationary & Isotropic	Distance $r = \|\mathbf{x} - \mathbf{x}'\|$	Translation + Rotation	RBF, Matérn
Stationary but Anisotropic	Vector $\mathbf{x} - \mathbf{x}'$	Translation only	ARD kernels
Non-stationary	Absolute positions $\mathbf{x}, \mathbf{x}'$	Neither	Neural Network kernel, Polynomial kernel

Why Stationarity Matters

Connection to Stochastic Processes

The mean function is constant (typically zero)
The autocovariance function depends only on the lag (difference in inputs)

The Positive Definiteness Requirement

This constraint ensures that the GP defines a valid probability distribution. Violating it leads to nonsensical 'negative variances.' Fortunately, the kernels we discuss here are all provably PSD.

The Radial Basis Function (RBF) Kernel

Mathematical Definition

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

Where:

$\sigma^2$ is the output variance (signal variance): controls the overall vertical scale of function variations
$\ell$ is the length-scale: controls how quickly the correlation decays with distance

Alternatively, using the squared distance $r^2 = |\mathbf{x} - \mathbf{x}'|^2$:

$$k_{\text{RBF}}(r) = \sigma^2 \exp\left(-\frac{r^2}{2\ell^2}\right)$$

Infinite Smoothness

Understanding the Length-Scale Parameter

The length-scale $\ell$ is perhaps the most important hyperparameter in GP modeling. It answers the question: over what distance do function values remain correlated?

Small $\ell$: Rapid decorrelation. Function can vary quickly over short distances. More wiggly samples.
Large $\ell$: Slow decorrelation. Function values remain similar over large distances. Smoother, more slowly-varying samples.

Rule of Thumb: If $|\mathbf{x} - \mathbf{x}'| \gg \ell$, then $k(\mathbf{x}, \mathbf{x}') \approx 0$. Points separated by more than a few length-scales are essentially uncorrelated.

Quantitative Relationship:

Distance	Correlation (as fraction of $\sigma^2$)
$r = 0$	1.000 (perfect correlation)
$r = \ell$	0.606
$r = 2\ell$	0.135
$r = 3\ell$	0.011

This exponential decay is characteristic of the RBF kernel.

rbf_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
 
def rbf_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0):
    """
    Compute the RBF (Squared Exponential) kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
        First set of input points
    X2 : ndarray of shape (n2, d)
        Second set of input points
    sigma_sq : float
        Signal variance (output scale)
    length_scale : float
        Length-scale (determines correlation decay rate)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
        Kernel matrix where K[i,j] = k(X1[i], X2[j])
    
    Mathematical Definition:
    k(x, x') = σ² · exp(-||x - x'||² / (2ℓ²))
    """
    # Compute pairwise squared Euclidean distances
    sq_dists = cdist(X1, X2, metric='sqeuclidean')
    
    # Apply the RBF kernel formula
    K = sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
    
    return K
 
def visualize_rbf_samples(length_scales=[0.5, 1.0, 2.0], n_samples=5):
    """
    Visualize GP samples with different length-scales to build intuition.
    """
    X = np.linspace(0, 10, 200).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, len(length_scales), figsize=(15, 4))
    
    for ax, ell in zip(axes, length_scales):
        K = rbf_kernel(X, X, sigma_sq=1.0, length_scale=ell)
        # Add small jitter for numerical stability
        K += 1e-8 * np.eye(len(X))
        
        # Sample from the GP prior
        samples = np.random.multivariate_normal(
            mean=np.zeros(len(X)), 
            cov=K, 
            size=n_samples
        )
        
        for sample in samples:
            ax.plot(X, sample, alpha=0.7)
        
        ax.set_title(f'Length-scale ℓ = {ell}')
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('RBF Kernel: Effect of Length-Scale on GP Samples')
    plt.tight_layout()
    return fig
 
# Example: Compare correlation decay
distances = np.array([0, 0.5, 1.0, 1.5, 2.0, 3.0])
ell = 1.0
correlations = np.exp(-distances**2 / (2 * ell**2))
print("Distance vs Correlation (ℓ = 1.0):")
for d, c in zip(distances, correlations):
    print(f"  r = {d:.1f}: correlation = {c:.4f}")

When to Use the RBF Kernel

✅ Appropriate for:

Smooth, slowly-varying functions without sharp transitions
Physical processes governed by diffusion or wave equations
Initial exploration when you have no strong prior beliefs about function roughness
Problems where infinite differentiability is a reasonable assumption

❌ Inappropriate for:

Functions with discontinuities or sharp transitions
Rough textures in spatial modeling (terrain, surfaces)
Time series with irregular patterns or sudden changes
Any domain where infinite smoothness is too strong an assumption

The Over-Smoothing Problem

The Matérn Family of Kernels

Mathematical Definition

The general Matérn kernel is:

$$k_{\text{Matérn}}(r) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}, r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}, r}{\ell}\right)$$

Where:

$r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance
$\sigma^2$ is the output variance
$\ell$ is the length-scale
$\nu > 0$ is the smoothness parameter (not a length-scale!)
$K_\nu$ is the modified Bessel function of the second kind
$\Gamma$ is the Gamma function

The Smoothness Parameter ν

The parameter $\nu$ directly controls the mean-squared differentiability of the GP samples:

A GP with the Matérn kernel is $\lceil \nu \rceil - 1$ times differentiable (in the mean-squared sense)
As $\nu \to \infty$, the Matérn kernel converges to the RBF kernel

Matérn Kernel Special Cases and Their Properties
ν Value	Common Name	Differentiability	Formula (Simplified)	Typical Use Case
ν = 1/2	Exponential / Matérn-1/2	C^0 (continuous, not differentiable)	$\sigma^2 \exp(-r/\ell)$	Rough, non-smooth functions (e.g., Brownian motion in space)
ν = 3/2	Matérn-3/2	C^1 (once differentiable)	$\sigma^2 (1 + \frac{\sqrt{3}r}{\ell}) \exp(-\frac{\sqrt{3}r}{\ell})$	Moderately smooth functions, physical processes
ν = 5/2	Matérn-5/2	C^2 (twice differentiable)	$\sigma^2 (1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}) \exp(-\frac{\sqrt{5}r}{\ell})$	Smooth functions, most common default
ν → ∞	RBF / Squared Exponential	C^∞ (infinitely differentiable)	$\sigma^2 \exp(-\frac{r^2}{2\ell^2})$	Very smooth, analytic functions

The Matérn-5/2 Sweet Spot

matern_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from scipy.special import gamma, kv as bessel_k
from scipy.spatial.distance import cdist
 
def matern_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0, nu=2.5):
    """
    Compute the Matérn kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    sigma_sq : float
        Signal variance
    length_scale : float
        Length-scale parameter
    nu : float
        Smoothness parameter (common: 0.5, 1.5, 2.5)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    # Compute pairwise Euclidean distances
    dists = cdist(X1, X2, metric='euclidean')
    
    # Handle special cases efficiently
    if nu == 0.5:
        # Matérn-1/2: Exponential kernel
        return sigma_sq * np.exp(-dists / length_scale)
    
    elif nu == 1.5:
        # Matérn-3/2
        sqrt3_r_over_ell = np.sqrt(3) * dists / length_scale
        return sigma_sq * (1 + sqrt3_r_over_ell) * np.exp(-sqrt3_r_over_ell)
    
    elif nu == 2.5:
        # Matérn-5/2
        sqrt5_r_over_ell = np.sqrt(5) * dists / length_scale
        return sigma_sq * (1 + sqrt5_r_over_ell + 
                          (sqrt5_r_over_ell**2) / 3) * np.exp(-sqrt5_r_over_ell)
    
    else:
        # General case using Bessel function
        scaled = np.sqrt(2 * nu) * dists / length_scale
        # Handle r=0 separately (limit is 1)
        K = np.ones_like(dists) * sigma_sq
        nonzero = dists > 1e-10
        K[nonzero] = sigma_sq * (
            (2**(1-nu)) / gamma(nu) * 
            (scaled[nonzero]**nu) * 
            bessel_k(nu, scaled[nonzero])
        )
        return K
 
def compare_matern_smoothness():
    """
    Visualize how the smoothness parameter affects GP samples.
    This demonstrates the practical difference between ν values.
    """
    import matplotlib.pyplot as plt
    
    X = np.linspace(0, 10, 300).reshape(-1, 1)
    nu_values = [0.5, 1.5, 2.5, 5.0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes = axes.flatten()
    
    np.random.seed(42)  # For reproducibility
    
    for ax, nu in zip(axes, nu_values):
        K = matern_kernel(X, X, sigma_sq=1.0, length_scale=1.0, nu=nu)
        K += 1e-6 * np.eye(len(X))  # Numerical stability
        
        # Sample from GP prior
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X), 5)
        
        for i in range(samples.shape[1]):
            ax.plot(X, samples[:, i], alpha=0.7)
        
        diff_order = int(np.ceil(nu) - 1) if nu < 10 else '∞'
        ax.set_title(f'Matérn-{nu} (C^{diff_order} smoothness)')
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Matérn Kernel: Effect of Smoothness Parameter ν', fontsize=14)
    plt.tight_layout()
    return fig

Choosing the Smoothness Parameter

Unlike the length-scale, which is typically learned from data, the smoothness parameter $\nu$ often requires domain knowledge or careful consideration:

Physical constraints: If derivatives of the modeled function have physical meaning (velocity, acceleration), choose $\nu$ to ensure they exist
Visual inspection: Examine your data—does it appear smooth or rough? Sharp transitions suggest lower $\nu$
Model comparison: Use marginal likelihood to compare different $\nu$ values
Default to Matérn-5/2: When uncertain, this provides a good balance of smoothness and flexibility

Why Not Learn ν?

While $\nu$ can be treated as a hyperparameter to optimize, this is often problematic:

The marginal likelihood landscape with respect to $\nu$ is typically flat
Small changes in $\nu$ can dramatically change the posterior
The interaction between $\nu$ and $\ell$ can make optimization difficult

For these reasons, practitioners usually fix $\nu$ to one of the special cases (0.5, 1.5, 2.5) and learn only $\sigma^2$ and $\ell$.

The Periodic Kernel

Mathematical Definition

$$k_{\text{Periodic}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{2\sin^2\left(\pi |\mathbf{x} - \mathbf{x}'| / p\right)}{\ell^2}\right)$$

Where:

$\sigma^2$ is the output variance
$p$ is the period (the distance after which the pattern repeats)
$\ell$ is the length-scale (controls smoothness within each period)

Key Properties

Exact periodicity: $k(x, x') = k(x+p, x'+p)$ — the covariance structure repeats exactly every $p$ units
Cross-period correlation: Points separated by exactly one period ($|x - x'| = p$) have the same correlation as points at the same location ($|x - x'| = 0$)
Infinite repetition: The kernel assumes the periodic pattern continues forever in both directions

Strict Periodicity Can Be Limiting

periodic_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.spatial.distance import cdist
 
def periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale=1.0):
    """
    Compute the Periodic (ExpSineSquared) kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    sigma_sq : float
        Signal variance
    period : float
        Period of the repeating pattern
    length_scale : float
        Length-scale (controls smoothness within period)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    # For 1D data, compute distance directly
    if X1.ndim == 1:
        X1 = X1.reshape(-1, 1)
    if X2.ndim == 1:
        X2 = X2.reshape(-1, 1)
    
    # Compute pairwise distances (for 1D, this is just |x - x'|)
    dists = cdist(X1, X2, metric='euclidean')
    
    # Apply periodic kernel formula
    sin_term = np.sin(np.pi * dists / period)
    K = sigma_sq * np.exp(-2 * (sin_term / length_scale)**2)
    
    return K
 
def locally_periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, 
                            length_scale_periodic=1.0, length_scale_decay=10.0):
    """
    Locally Periodic Kernel: Periodic × RBF
    
    This models patterns that are approximately periodic but can 
    change smoothly over time (e.g., amplitude decay, period drift).
    
    k(x,x') = k_periodic(x,x') × k_rbf(x,x')
    """
    # Periodic component
    dists = cdist(X1, X2, metric='euclidean')
    sin_term = np.sin(np.pi * dists / period)
    K_periodic = np.exp(-2 * (sin_term / length_scale_periodic)**2)
    
    # RBF decay component
    sq_dists = cdist(X1, X2, metric='sqeuclidean')
    K_rbf = np.exp(-sq_dists / (2 * length_scale_decay**2))
    
    # Combined locally-periodic kernel
    return sigma_sq * K_periodic * K_rbf
 
# Demonstration: Modeling daily temperature patterns
def demo_periodic_modeling():
    import matplotlib.pyplot as plt
    
    # Generate synthetic temperature data with daily periodicity
    np.random.seed(42)
    hours = np.linspace(0, 72, 73)  # 3 days, hourly
    
    # True underlying function: daily cycle with some noise
    true_temp = 20 + 8 * np.sin(2 * np.pi * hours / 24 - np.pi/2)  # Peak at 2PM
    observed = true_temp + np.random.randn(len(hours)) * 1.5
    
    # Sample observed points (missing some hours)
    train_idx = np.random.choice(len(hours), 40, replace=False)
    train_idx.sort()
    X_train = hours[train_idx].reshape(-1, 1)
    y_train = observed[train_idx]
    
    # Predict on full grid
    X_test = np.linspace(0, 72, 200).reshape(-1, 1)
    
    # Build kernel with known period (24 hours)
    K_train = periodic_kernel(X_train, X_train, sigma_sq=64.0, period=24.0, length_scale=3.0)
    K_train += 2.25 * np.eye(len(X_train))  # Noise variance
    
    K_test_train = periodic_kernel(X_test, X_train, sigma_sq=64.0, period=24.0, length_scale=3.0)
    K_test = periodic_kernel(X_test, X_test, sigma_sq=64.0, period=24.0, length_scale=3.0)
    
    # GP posterior
    L = np.linalg.cholesky(K_train)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    
    mean = K_test_train @ alpha
    v = np.linalg.solve(L, K_test_train.T)
    var = np.diag(K_test) - np.sum(v**2, axis=0)
    std = np.sqrt(var)
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.fill_between(X_test.flatten(), mean - 2*std, mean + 2*std, 
                    alpha=0.3, label='95% CI')
    ax.plot(X_test, mean, 'b-', label='GP Posterior Mean')
    ax.plot(hours, true_temp, 'g--', alpha=0.7, label='True Function')
    ax.scatter(X_train, y_train, c='red', s=30, zorder=5, label='Observations')
    
    ax.set_xlabel('Hours')
    ax.set_ylabel('Temperature (°C)')
    ax.set_title('Periodic Kernel: Modeling Daily Temperature Cycles')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

Understanding the Period and Length-Scale Interaction

The periodic kernel has a subtle interplay between its parameters:

Period $p$: Determines where the pattern repeats. This is often known from domain knowledge (24 hours for daily, 365.25 days for yearly)
Length-scale $\ell$: Controls how smoothly the pattern varies within each period. Smaller $\ell$ allows more wiggly variations within the period; larger $\ell$ produces smoother within-period behavior

When $\ell \ll p$: The function can vary rapidly within each period When $\ell \gg p$: The function becomes nearly sinusoidal (only the fundamental frequency matters)

Extensions and Variations

Locally Periodic: $k_{\text{LP}} = k_{\text{Periodic}} \times k_{\text{RBF}}$
- Captures periodic patterns that decay or evolve over time
Quasi-Periodic: Period itself varies smoothly
- Useful for phenomena with drifting frequencies
Multiple Periods: Sum of periodic kernels with different periods
- Models multi-frequency phenomena (e.g., daily + weekly + yearly)

Other Important Stationary Kernels

Additional Stationary Kernels

•Rational Quadratic (RQ) Kernel: Equivalent to an infinite sum of RBF kernels with different length-scales. Useful when the function exhibits variations at multiple scales. Formula: $k_{\text{RQ}}(r) = \sigma^2 \left(1 + \frac{r^2}{2\alpha\ell^2}\right)^{-\alpha}$
•Ornstein-Uhlenbeck Kernel: Identical to Matérn-1/2. Models continuous but non-differentiable processes. Arises naturally in physics (Brownian motion). Formula: $k_{\text{OU}}(r) = \sigma^2 \exp(-r/\ell)$
•Spectral Mixture Kernel: Represents the kernel through its Fourier spectrum as a mixture of Gaussians. Extremely flexible, can approximate any stationary kernel. Invented by Andrew Gordon Wilson (2013).
•Cosine Kernel: Models purely oscillatory behavior without decay. Formula: $k_{\text{cos}}(r) = \sigma^2 \cos(2\pi r / p)$. Often combined with decay-inducing kernels.

Comparison of Stationary Kernels
Kernel	Smoothness	Parameters	Key Feature	Computational Cost
RBF	C^∞	σ², ℓ	Infinite smoothness	Low
Matérn-5/2	C^2	σ², ℓ	Controllable smoothness	Low
Matérn-3/2	C^1	σ², ℓ	Once differentiable	Low
Matérn-1/2	C^0	σ², ℓ	Continuous only (rough)	Low
Periodic	C^∞	σ², ℓ, p	Exact periodicity	Low
Rational Quadratic	C^∞	σ², ℓ, α	Multi-scale smoothness	Low
Spectral Mixture	Varies	Many	Extreme flexibility	Medium-High

The Rational Quadratic Kernel

The Rational Quadratic (RQ) kernel deserves special attention due to its unique interpretation:

$$k_{\text{RQ}}(r) = \sigma^2 \left(1 + \frac{r^2}{2\alpha\ell^2}\right)^{-\alpha}$$

The parameter $\alpha > 0$ controls how the different length-scales are mixed:

As $\alpha \to \infty$: RQ kernel → RBF kernel
Small $\alpha$: Significant contribution from both short and long length-scales

Practical Guidance:

Start with simpler kernels (RBF or Matérn-5/2) unless you have evidence of multi-scale structure
RQ can help when a single length-scale seems inadequate but you don't want to commit to a specific composition
The extra parameter $\alpha$ adds flexibility but also complexity in hyperparameter optimization

Practical Kernel Selection Guidelines

A Systematic Selection Process

Step-by-Step Kernel Selection

•Assess Smoothness Requirements: Are derivatives of your function meaningful? Do you expect rough or smooth behavior? → Choose Matérn-ν accordingly, or RBF for very smooth functions
•Check for Periodicity: Does your data exhibit repeating patterns on known timescales? → Include a periodic component
•Consider Multi-Scale Structure: Does the data show variations at multiple length-scales? → Consider Rational Quadratic or kernel composition
•Evaluate Stationarity Assumption: Is it reasonable that correlations are the same everywhere in the input domain? → If not, consider non-stationary kernels (covered later)
•Start Simple: Begin with Matérn-5/2 or RBF, add complexity only when needed
•Validate with Marginal Likelihood: Compare candidate kernels using log marginal likelihood
•Cross-Validate Predictions: Ensure the chosen kernel generalizes to held-out data

When to Use Stationary Kernels

✓ Homogeneous domains (no spatial variation in properties) ✓ Temporal data without trend or drift ✓ Functions where smoothness is constant ✓ As building blocks for more complex kernels

When Stationary Kernels Fail

✗ Functions with varying smoothness ✗ Data with trends or drifts ✗ Boundaries with different behavior ✗ Multi-fidelity or hierarchical data

Common Pitfalls and How to Avoid Them

Over-Reliance on RBF: The RBF kernel is the 'Gaussian distribution of kernels'—simple and often good enough, but its infinite smoothness assumption is rarely accurate. Default to Matérn-5/2 instead.
Ignoring Periodicity: Failing to model known periodic structure leads to poor predictions. If you know the period, encode it!
Too Many Parameters: Complex kernels with many hyperparameters can overfit or be hard to optimize. Start simple and add complexity incrementally.
Forgetting Scaling: Input features should typically be normalized. A length-scale of 1.0 means different things for features ranging [0,1] vs [0,1000].
Not Visualizing Samples: Before fitting, sample from the prior to check if the kernel produces reasonable-looking functions. If prior samples don't resemble plausible data, reconsider your kernel choice.

Summary: Stationary Kernels

Key Takeaways

•Stationary ⟹ Translation-Invariant: Covariance depends on $\mathbf{x} - \mathbf{x}'$, not absolute position
•RBF (Squared Exponential): Infinitely smooth, simple, but often over-smooths real data
•Matérn Family: Controllable smoothness via $\nu$; Matérn-5/2 is the practical default
•Periodic Kernel: Encodes exact periodicity; combine with RBF for locally-periodic patterns
•Length-scale $\ell$: The fundamental scale of input-space correlations; learned from data
•Kernel Selection: Start simple, validate with marginal likelihood, visualize prior samples

What's Next:

Page Complete

1 / 5