Machine LearningKernel Methods for Regression

Gaussian Processes Introduction

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

2 / 5

Prior Specification

The Art of Encoding Domain Knowledge

In the previous page, we established that a Gaussian Process is fully specified by two functions: a mean function $m(x)$ and a kernel function $k(x, x')$. These functions encode our prior beliefs about the unknown function $f$ before observing any data.

But how do we choose these functions? The answer lies at the intersection of mathematical rigor and domain expertise. A well-specified prior can dramatically improve predictions with limited data; a poorly specified prior can mislead us despite abundant observations.

This page explores the rich landscape of mean and kernel function design. We'll examine how different choices express different beliefs about function behavior—smoothness, periodicity, trends, and locality. By the end, you'll have the vocabulary and tools to construct GP priors tailored to specific problems.

What You Will Learn

By the end of this page, you will understand how to design mean functions that encode trend beliefs, how kernel functions control smoothness and correlation structure, the major families of kernels and their properties, and how to combine kernels to express complex prior beliefs. You will develop the judgment needed to select appropriate priors for real problems.

The Mean Function

The mean function $m(x)$ specifies our prior expectation of the function value at each input:

$$\mathbb{E}[f(x)] = m(x)$$

This captures our "best guess" for $f(x)$ before seeing any data. Common choices include:

Zero Mean: $m(x) = 0$

The most common choice, and often reasonable. A zero mean function says: "Before seeing data, I expect the function to fluctuate around zero, with the kernel controlling the fluctuations."

This choice is justified when:

Data will be centered (mean-subtracted)
We have no strong prior belief about the function's "location"
The kernel's variance parameter $\sigma^2$ captures expected magnitude

Constant Mean: $m(x) = c$

A simple generalization: $c$ becomes a hyperparameter to estimate. This shifts the function vertically: $$f(x) \sim \mathcal{GP}(c, k)$$

Linear Mean: $m(x) = \beta_0 + \beta_1 x$

When we believe the function has an underlying linear trend, we can encode this:

In 1D: $m(x) = \beta_0 + \beta_1 x$
In $d$ dimensions: $m(\mathbf{x}) = \beta_0 + \boldsymbol{\beta}^T \mathbf{x}$

The GP then models deviations from this linear trend.

Mean Function Philosophy

There's a philosophical question: should we model trend in the mean function or in the kernel? A linear mean explicitly separates trend from fluctuations. Alternatively, a kernel that doesn't decay to zero at infinity (like polynomial kernels) can capture trend. The choice affects interpretability and hyperparameter meaning, but with enough data, both approaches can fit the same functions.

mean_functions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import matplotlib.pyplot as plt
 
# Mean function implementations
def zero_mean(X):
    """Zero mean function: m(x) = 0"""
    return np.zeros(len(X))
 
def constant_mean(X, c=2.0):
    """Constant mean function: m(x) = c"""
    return c * np.ones(len(X))
 
def linear_mean(X, beta0=0.5, beta1=0.3):
    """Linear mean function: m(x) = β₀ + β₁x"""
    return beta0 + beta1 * X.ravel()
 
def quadratic_mean(X, beta0=0.0, beta1=0.0, beta2=0.1):
    """Quadratic mean function: m(x) = β₀ + β₁x + β₂x²"""
    x = X.ravel()
    return beta0 + beta1 * x + beta2 * x**2
 
# Kernel function for generating samples
def se_kernel_matrix(X, sigma=1.0, length_scale=1.0):
    """Squared Exponential kernel matrix"""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            sqdist = np.sum((X[i] - X[j]) ** 2)
            K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2))
    return K + 1e-8 * np.eye(n)
 
# Visualization
np.random.seed(42)
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
mean_functions = [
    ('Zero Mean: m(x) = 0', zero_mean),
    ('Constant Mean: m(x) = 2', lambda X: constant_mean(X, c=2.0)),
    ('Linear Mean: m(x) = 0.5 + 0.3x', lambda X: linear_mean(X, 0.5, 0.3)),
    ('Quadratic Mean: m(x) = 0.1x²', lambda X: quadratic_mean(X, 0, 0, 0.1))
]
 
K = se_kernel_matrix(X, sigma=0.8, length_scale=1.5)
L = np.linalg.cholesky(K)
 
for ax, (title, mean_func) in zip(axes.ravel(), mean_functions):
    mean = mean_func(X)
    std = np.sqrt(np.diag(K))
    
    # Sample functions
    samples = mean[:, np.newaxis] + L @ np.random.randn(len(X), 5)
    
    # Plot
    ax.fill_between(X.ravel(), mean - 2*std, mean + 2*std, 
                     alpha=0.15, color='C0')
    ax.plot(X, mean, 'C3-', linewidth=2, label='Mean function')
    for i in range(5):
        ax.plot(X, samples[:, i], alpha=0.6, linewidth=1)
    ax.set_title(title)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-5, 5)
 
plt.suptitle('GP Priors with Different Mean Functions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Practical Guidelines for Mean Functions:

Start Simple: Zero mean is often sufficient. Only add complexity if you have strong prior beliefs or the problem demands it.
Consider Data Preprocessing: Often, it's cleaner to center and scale data, then use a zero-mean GP, rather than trying to capture scale through the mean function.
Mean Functions Add Parameters: A linear mean adds $d+1$ hyperparameters in $d$ dimensions. More complex means increase the hyperparameter burden.
Posterior Updates Mean: After conditioning on data, the posterior mean will deviate from the prior mean. With sufficient data, the prior mean matters less.
Physical Interpretation: In scientific applications, mean functions can encode known physics (e.g., a baseline behavior, an expected trend from theory).

The Kernel Function — Core Concepts

The kernel function $k(x, x')$ is the heart of a Gaussian Process. While the mean function sets where we expect the function to be, the kernel determines how values at different inputs relate to each other—the correlation structure of the function.

Formal Properties:

A valid kernel must produce positive semi-definite (PSD) covariance matrices for any finite collection of inputs. That is, for any inputs ${x_1, \ldots, x_n}$, the matrix $K$ with entries $K_{ij} = k(x_i, x_j)$ must satisfy:

$$\mathbf{v}^T K \mathbf{v} \geq 0 \quad \forall \mathbf{v} \in \mathbb{R}^n$$

This ensures the "covariance matrix" is valid. Mercer's theorem guarantees that any kernel satisfying certain continuity conditions is PSD.

What the Kernel Controls:

Kernel Controls Function Properties

•Variance — $k(x, x)$ gives the prior variance of $f(x)$. Higher values mean larger expected function magnitudes.
•Correlation Decay — How $k(x, x')$ decreases as $|x - x'|$ increases determines how quickly function values become uncorrelated.
•Smoothness — The differentiability of $k$ determines the smoothness of sampled functions. Infinitely differentiable kernels yield infinitely smooth functions.
•Stationarity — If $k(x, x') = k(x - x')$, the kernel is stationary: correlation depends only on displacement, not absolute position.
•Isotropy — If $k(x, x') = k(|x - x'|)$, the kernel is isotropic: correlation is the same in all directions.
•Periodicity — Some kernels encode periodic structure, causing correlation to oscillate with distance.

Kernels Define Function Space

The kernel implicitly defines a space of functions—the Reproducing Kernel Hilbert Space (RKHS). Functions with small RKHS norm are 'typical' under the GP prior; functions with large norm are 'atypical.' This connection to functional analysis provides theoretical grounding for kernel methods.

Key Kernel Properties and Their Implications
Property	Mathematical Condition	Effect on Functions	Example
Stationarity	$k(x, x') = k(x - x')$	Translation invariance	SE, Matérn, Periodic
Isotropy	$k(x, x') = k(\|x - x'\|)$	Rotation invariance	SE, Matérn (isotropic)
Infinite differentiability	$k \in C^\infty$	Infinitely smooth functions	SE kernel
Finite differentiability	$k \in C^{\nu}$ only	Finite smoothness	Matérn-$\nu$ kernel
Compact support	$k(x,x') = 0$ for $\|x-x'\| > r$	Sparse kernel matrices	Wendland kernels

Common Kernel Families

Let's examine the most important kernel families in detail, understanding their properties and use cases.

1. Squared Exponential (SE) / Radial Basis Function (RBF):

$$k_{SE}(x, x') = \sigma^2 \exp\left( -\frac{|x - x'|^2}{2\ell^2} \right)$$

Properties:

Infinitely differentiable → Infinitely smooth functions
Stationary and isotropic
Parameters: $\sigma^2$ (variance), $\ell$ (length scale)

The SE kernel is the "default" choice but often too smooth for real data.

2. Matérn Family:

$$k_{Matérn}(r) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} r}{\ell} \right)^{\nu} K_{\nu}\left( \frac{\sqrt{2\nu} r}{\ell} \right)$$

where $r = |x - x'|$ and $K_\nu$ is the modified Bessel function.

Properties:

$\nu$ controls smoothness: functions are $\lceil \nu \rceil - 1$ times differentiable
Special cases:
- $\nu = 1/2$ (Ornstein-Uhlenbeck): exponential decay, non-differentiable (rough)
- $\nu = 3/2$: once differentiable
- $\nu = 5/2$: twice differentiable
- $\nu \to \infty$: recovers SE kernel

The Matérn-5/2 kernel is often a good default—smooth but not unrealistically so.

kernel_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import gamma, kv  # For Matérn kernel
 
def se_kernel(x1, x2, sigma=1.0, length_scale=1.0):
    """Squared Exponential kernel"""
    r = np.linalg.norm(x1 - x2)
    return sigma**2 * np.exp(-r**2 / (2 * length_scale**2))
 
def matern_kernel(x1, x2, sigma=1.0, length_scale=1.0, nu=2.5):
    """Matérn kernel with smoothness parameter nu"""
    r = np.linalg.norm(x1 - x2)
    if r == 0:
        return sigma**2
    
    sqrt_2nu_r_l = np.sqrt(2 * nu) * r / length_scale
    coeff = sigma**2 * (2**(1-nu)) / gamma(nu)
    return coeff * (sqrt_2nu_r_l**nu) * kv(nu, sqrt_2nu_r_l)
 
def matern_12(x1, x2, sigma=1.0, length_scale=1.0):
    """Matérn-1/2 (Ornstein-Uhlenbeck / Exponential)"""
    r = np.linalg.norm(x1 - x2)
    return sigma**2 * np.exp(-r / length_scale)
 
def matern_32(x1, x2, sigma=1.0, length_scale=1.0):
    """Matérn-3/2 (once differentiable)"""
    r = np.linalg.norm(x1 - x2)
    sqrt3_r_l = np.sqrt(3) * r / length_scale
    return sigma**2 * (1 + sqrt3_r_l) * np.exp(-sqrt3_r_l)
 
def matern_52(x1, x2, sigma=1.0, length_scale=1.0):
    """Matérn-5/2 (twice differentiable)"""
    r = np.linalg.norm(x1 - x2)
    sqrt5_r_l = np.sqrt(5) * r / length_scale
    return sigma**2 * (1 + sqrt5_r_l + sqrt5_r_l**2/3) * np.exp(-sqrt5_r_l)
 
def periodic_kernel(x1, x2, sigma=1.0, length_scale=1.0, period=1.0):
    """Periodic kernel for repeating patterns"""
    r = np.linalg.norm(x1 - x2)
    return sigma**2 * np.exp(-2 * np.sin(np.pi * r / period)**2 / length_scale**2)
 
def build_kernel_matrix(X, kernel_func, **params):
    """Build kernel matrix from kernel function"""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j], **params)
    return K + 1e-8 * np.eye(n)
 
# Visualization
np.random.seed(42)
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
 
kernels = [
    ('Squared Exponential', se_kernel, {'length_scale': 1.0}),
    ('Matérn-1/2 (Rough)', matern_12, {'length_scale': 1.0}),
    ('Matérn-3/2 (Once Diff.)', matern_32, {'length_scale': 1.0}),
    ('Matérn-5/2 (Twice Diff.)', matern_52, {'length_scale': 1.0}),
    ('Periodic (p=2)', periodic_kernel, {'length_scale': 0.5, 'period': 2.0}),
    ('SE Short Length (ℓ=0.3)', se_kernel, {'length_scale': 0.3}),
]
 
for ax, (name, kernel_func, params) in zip(axes.ravel(), kernels):
    K = build_kernel_matrix(X, kernel_func, sigma=1.0, **params)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7)
    ax.set_title(name)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-3, 3)
 
plt.suptitle('GP Samples with Different Kernels', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

3. Periodic Kernel:

$$k_{Per}(x, x') = \sigma^2 \exp\left( -\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2} \right)$$

Properties:

Enforces exact periodicity with period $p$
Points separated by integer multiples of $p$ have maximum correlation
Often combined with SE to allow periodicity to vary

4. Rational Quadratic (RQ):

$$k_{RQ}(x, x') = \sigma^2 \left( 1 + \frac{|x-x'|^2}{2\alpha\ell^2} \right)^{-\alpha}$$

Properties:

Equivalent to an infinite mixture of SE kernels with varying length scales
$\alpha$ controls how sharp vs. smooth the decay is
$\alpha \to \infty$ recovers SE kernel
Useful when multiple length scales present

5. Linear (Polynomial) Kernels:

$$k_{Lin}(x, x') = \sigma_b^2 + \sigma_v^2 (x - c)(x' - c)$$

Properties:

Non-stationary: variance grows away from center $c$
Samples are linear functions with random slopes
Polynomial kernels $k(x, x') = (\langle x, x' \rangle + c)^d$ give polynomial functions

Kernel Hyperparameters

Every kernel family has hyperparameters that control its behavior. Understanding these parameters is essential for both specifying priors and interpreting learned models.

Signal Variance $\sigma^2$ (or $\sigma_f^2$):

The signal variance controls the amplitude of function variation. It appears as a multiplicative factor: $$k(x, x') = \sigma^2 \cdot \tilde{k}(x, x')$$

Higher $\sigma^2$ → functions with larger magnitude fluctuations
Lower $\sigma^2$ → functions closer to the mean
At any point $x$: $\text{Var}(f(x)) = k(x, x) = \sigma^2$ (for normalized base kernels)

Length Scale $\ell$:

The length scale controls how quickly correlation decays with distance:

Large $\ell$ → function values at distant points remain correlated → smooth functions
Small $\ell$ → correlation decays rapidly → rough functions with rapid variation

Intuitively, $\ell$ answers: "How far do I need to move in input space before the function value becomes essentially uncorrelated?"

Length Scale Interpretation

For the SE kernel, the correlation between f(x) and f(x') equals e^(-1/2) ≈ 0.61 when |x - x'| = ℓ. This provides a concrete interpretation: ℓ is roughly the distance over which the function can change significantly. When choosing priors, ask: 'Over what input distance do I expect the function to vary meaningfully?'

hyperparameter_effects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
 
def se_kernel_matrix(X, sigma=1.0, length_scale=1.0):
    """SE kernel matrix construction"""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            r2 = np.sum((X[i] - X[j])**2)
            K[i, j] = sigma**2 * np.exp(-r2 / (2 * length_scale**2))
    return K + 1e-8 * np.eye(n)
 
np.random.seed(42)
X = np.linspace(-5, 5, 150).reshape(-1, 1)
 
# Effect of length scale
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Row 1: Varying length scale (fixed sigma=1.0)
length_scales = [0.3, 1.0, 3.0]
for ax, ls in zip(axes[0], length_scales):
    K = se_kernel_matrix(X, sigma=1.0, length_scale=ls)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7)
    ax.set_title(f'Length Scale ℓ = {ls}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim(-4, 4)
    ax.grid(True, alpha=0.3)
 
# Row 2: Varying sigma (fixed length_scale=1.0)
sigmas = [0.5, 1.0, 2.0]
for ax, sig in zip(axes[1], sigmas):
    K = se_kernel_matrix(X, sigma=sig, length_scale=1.0)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7)
    ax.set_title(f'Signal Variance σ = {sig}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim(-6, 6)
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Effect of Hyperparameters on GP Samples', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
# Visualization of correlation decay
fig, ax = plt.subplots(figsize=(10, 6))
distances = np.linspace(0, 5, 100)
 
for ls in [0.5, 1.0, 2.0]:
    correlation = np.exp(-distances**2 / (2 * ls**2))
    ax.plot(distances, correlation, linewidth=2, label=f'ℓ = {ls}')
    # Mark length scale
    ax.axhline(y=np.exp(-0.5), color='gray', linestyle='--', alpha=0.3)
    ax.axvline(x=ls, color='gray', linestyle=':', alpha=0.3)
 
ax.set_xlabel('Distance |x - x'|')
ax.set_ylabel('Correlation')
ax.set_title('Correlation Decay for Different Length Scales')
ax.legend()
ax.grid(True, alpha=0.3)
ax.annotate('Correlation = 0.61 (at ℓ)', xy=(1.0, np.exp(-0.5)), 
            xytext=(2.5, 0.5), arrowprops=dict(arrowstyle='->', color='gray'))
plt.show()

Anisotropic Length Scales (ARD):

In multi-dimensional inputs, we can assign different length scales to each dimension:

$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left( -\frac{1}{2} \sum_{d=1}^{D} \frac{(x_d - x_d')^2}{\ell_d^2} \right)$$

This is called Automatic Relevance Determination (ARD):

Large $\ell_d$ → the function varies slowly along dimension $d$ → dimension $d$ is less "relevant"
Small $\ell_d$ → rapid variation along dimension $d$ → dimension $d$ is highly "relevant"

ARD provides automatic feature selection: irrelevant dimensions will have their length scales pushed to infinity during optimization, effectively removing them from the model.

Common Hyperparameters and Their Effects
Hyperparameter	Symbol	Range	Effect When Increased
Signal Variance	$\sigma^2$	$(0, \infty)$	Larger function magnitude; wider confidence bands
Length Scale	$\ell$	$(0, \infty)$	Smoother functions; slower variation
Noise Variance	$\sigma_n^2$	$(0, \infty)$	More interpolation; less exact fit to data
Matérn Smoothness	$\nu$	$(0, \infty)$	More differentiable functions
Period	$p$	$(0, \infty)$	Longer period for periodic patterns

Combining Kernels

One of the most powerful aspects of kernel methods is that we can combine kernels to express complex prior beliefs. Just as we can build complex probability distributions from simpler ones, we can construct expressive kernels from basic building blocks.

Kernel Closure Properties:

If $k_1$ and $k_2$ are valid (PSD) kernels, so are:

1. Sum: $k(x, x') = k_1(x, x') + k_2(x, x')$

Interpretation: $f = f_1 + f_2$ where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$
Functions have characteristics of both kernels
Example: Linear trend + smooth deviations = Linear + SE

2. Product: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$

Interpretation: Correlation requires agreement under both kernels
Example: SE $\times$ Periodic = locally periodic (periodicity that can evolve)

3. Scaling: $k(x, x') = c \cdot k_1(x, x')$ for constant $c > 0$

4. Composition with function: $k(x, x') = k_1(g(x), g(x'))$ for any function $g$

Example: Warping input space to handle non-stationarity

Additive vs Multiplicative Combination

Addition says: 'The function is a sum of independent components with different characteristics.' Multiplication says: 'The function must satisfy constraints from both kernels simultaneously.' Addition tends to create more complex functions; multiplication creates more constrained ones. A product of kernels is always 'at most as smooth' as the smoothest factor.

kernel_combinations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
 
def se_kernel(x1, x2, sigma=1.0, length_scale=1.0):
    r2 = np.sum((x1 - x2)**2)
    return sigma**2 * np.exp(-r2 / (2 * length_scale**2))
 
def periodic_kernel(x1, x2, sigma=1.0, length_scale=1.0, period=1.0):
    r = np.abs(x1 - x2).sum()
    return sigma**2 * np.exp(-2 * np.sin(np.pi * r / period)**2 / length_scale**2)
 
def linear_kernel(x1, x2, sigma_b=1.0, sigma_v=1.0, c=0.0):
    return sigma_b**2 + sigma_v**2 * (x1 - c) * (x2 - c)
 
def build_kernel_matrix(X, kernel_func, **params):
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i].item(), X[j].item(), **params)
    return K + 1e-8 * np.eye(n)
 
# Combined kernels
def linear_plus_se(x1, x2):
    """Linear trend + smooth deviations"""
    return linear_kernel(x1, x2, sigma_v=0.3) + se_kernel(x1, x2, sigma=1.0, length_scale=1.0)
 
def se_times_periodic(x1, x2):
    """Locally periodic: periodic structure that can evolve"""
    return se_kernel(x1, x2, sigma=1.0, length_scale=3.0) * periodic_kernel(x1, x2, length_scale=0.5, period=2.0)
 
def se_plus_periodic(x1, x2):
    """Smooth + periodic: long-term trend with seasonal component"""
    return se_kernel(x1, x2, sigma=1.0, length_scale=2.0) + periodic_kernel(x1, x2, sigma=0.5, length_scale=0.5, period=1.5)
 
def se_plus_se(x1, x2):
    """Multi-scale: both local and global structure"""
    return se_kernel(x1, x2, sigma=0.5, length_scale=0.5) + se_kernel(x1, x2, sigma=1.0, length_scale=3.0)
 
# Visualization
np.random.seed(42)
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
combos = [
    ('Linear + SE (trend + deviations)', linear_plus_se),
    ('SE × Periodic (evolving periodicity)', se_times_periodic),
    ('SE + Periodic (trend + seasonal)', se_plus_periodic),
    ('SE(short) + SE(long) (multi-scale)', se_plus_se),
]
 
for ax, (name, kernel_func) in zip(axes.ravel(), combos):
    K = build_kernel_matrix(X, kernel_func)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7, linewidth=1.5)
    ax.set_title(name)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('GP Samples with Composite Kernels', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Common Composite Kernels:

Long-term + Short-term (Multi-scale): $$k = k_{SE}(\ell_{long}) + k_{SE}(\ell_{short})$$ Captures both global trends and local wiggles.
Trend + Seasonality: $$k = k_{Linear} + k_{Periodic}$$ Common in time series: linear growth with periodic fluctuations.
Locally Periodic: $$k = k_{SE} \times k_{Periodic}$$ Periodicity that can change amplitude over time.
Non-stationary via Warping: $$k(x, x') = k_{SE}(g(x), g(x'))$$ Where $g$ is a warping function (e.g., log, sigmoid).
Additive Structure: $$k(\mathbf{x}, \mathbf{x}') = \sum_d k_d(x_d, x_d')$$ For high-dimensional inputs where the function is a sum of univariate functions.

Expressing Domain Knowledge

The true power of GP prior specification lies in encoding domain knowledge. The kernel is a language for expressing what we believe about the unknown function before seeing data.

Smoothness Beliefs:

"The function should be very smooth" → SE kernel or Matérn with large $\nu$
"The function can be rough with occasional discontinuities" → Matérn-1/2 or mixture
"The function has exactly $k$ continuous derivatives" → Matérn-$\nu$ with $\nu = k + 1/2$

Scale Beliefs:

"Function values typically range from -5 to 5" → Set $\sigma \approx 2.5$ (so 95% within $\pm 2\sigma$)
"The function changes significantly over distances of ~10 units" → Set $\ell \approx 10$
"The function varies on multiple scales" → Sum of kernels with different $\ell$ values

Structural Beliefs:

"There's a seasonal component with period 12" → Include $k_{Periodic}(p=12)$
"There's an underlying linear trend" → Include $k_{Linear}$
"The function is more variable in some regions" → Non-stationary kernel or input warping

Prior Specification is Not Curve Fitting

When specifying priors, resist the temptation to choose kernels that will fit the data well. The prior should encode beliefs you hold before seeing data. If you peek at the data and then choose a kernel 'because it looks like it will work,' you're doing informal model selection, not prior specification. This distinction matters for principled Bayesian inference and for honest uncertainty quantification.

Translating Domain Knowledge to Kernel Choices
Domain Belief	Kernel Choice	Hyperparameter Setting
Physical processes are continuous	Matérn-5/2 or SE	Standard length scale from physics
Time series with daily seasonality	Periodic × SE	Period = 24 hours; SE for drift
Stock prices (random walk-like)	Linear kernel + noise	Variance matching historical volatility
Sensor noise (rough, continuous)	Matérn-3/2	Short length scale for rapid variation
Spatial interpolation	SE with ARD	Separate length scales per coordinate
Multi-fidelity experiments	Sum/product of kernels	Different kernels for different fidelities

A Principled Workflow for Prior Specification:

List your beliefs: Before looking at data, write down what you expect. Is the function smooth? Does it have trends? Seasonality? What's the typical magnitude?
Translate to kernels: Map each belief to a kernel component. Combine with sums (additive structure) or products (interactive structure).
Set hyperparameter ranges: Based on domain knowledge, determine plausible ranges for each hyperparameter.
Visualize prior samples: Draw samples from your GP prior. Do they look reasonable? If not, revise.
Document your choices: For reproducibility and scientific rigor, record why you made each choice.
Plan for learning: Decide which hyperparameters to estimate from data (optimize or marginalize) vs. fix from domain knowledge.

Prior Sampling as Validation

One of the most valuable practices in GP modeling is prior predictive checking: sampling functions from your prior and examining whether they look plausible before seeing any data.

This practice catches several common issues:

1. Unreasonable Smoothness:

Prior samples too wiggly? → Increase length scale
Prior samples too smooth? → Decrease length scale or use less smooth kernel

2. Wrong Scale:

Samples have magnitude 0.001 but we expect values in thousands? → Increase $\sigma$
Samples span huge range but data is bounded? → Decrease $\sigma$

3. Missing Structure:

We expect periodicity but samples don't show it? → Add periodic kernel
We expect trend but samples fluctuate around zero? → Add linear kernel or non-zero mean

4. Wrong Complexity:

Samples capture only simple patterns but we need complex ones? → Combine kernels
Samples too complex for our data quantity? → Simplify or add stronger priors

prior_checking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
import matplotlib.pyplot as plt
 
def comprehensive_prior_check(X, kernel_func, n_samples=20, title="Prior Check"):
    """
    Comprehensive prior predictive check for a GP.
    Plots samples, mean, and credible intervals.
    """
    n = len(X)
    
    # Build kernel matrix
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i].item(), X[j].item())
    K += 1e-8 * np.eye(n)
    
    # Sample from prior
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(n, n_samples)
    
    # Compute statistics
    prior_mean = np.zeros(n)  # Zero mean GP
    prior_std = np.sqrt(np.diag(K))
    sample_percentiles = np.percentile(samples, [5, 25, 50, 75, 95], axis=1)
    
    # Create figure
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: All samples
    ax = axes[0, 0]
    for i in range(min(10, n_samples)):
        ax.plot(X, samples[:, i], alpha=0.5)
    ax.axhline(0, color='k', linestyle='--', alpha=0.3)
    ax.set_title('Prior Samples (10 shown)')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Mean and credible bands (from samples)
    ax = axes[0, 1]
    ax.fill_between(X.ravel(), sample_percentiles[0], sample_percentiles[4], 
                     alpha=0.15, label='90% CI', color='C0')
    ax.fill_between(X.ravel(), sample_percentiles[1], sample_percentiles[3], 
                     alpha=0.25, label='50% CI', color='C0')
    ax.plot(X, sample_percentiles[2], 'C0-', linewidth=2, label='Median')
    ax.plot(X, prior_mean, 'k--', linewidth=1, label='Prior Mean')
    ax.set_title('Prior Predictive Distribution')
    ax.legend()
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
    
    # Plot 3: Histogram of function values at a single point
    ax = axes[1, 0]
    mid_idx = n // 2
    ax.hist(samples[mid_idx, :], bins=20, density=True, alpha=0.7, color='C1')
    x_vals = np.linspace(samples[mid_idx, :].min(), samples[mid_idx, :].max(), 100)
    ax.plot(x_vals, np.exp(-x_vals**2 / (2 * prior_std[mid_idx]**2)) / 
            (prior_std[mid_idx] * np.sqrt(2*np.pi)), 'k-', linewidth=2)
    ax.set_title(f'Marginal at x = {X[mid_idx].item():.2f}')
    ax.set_xlabel('f(x)')
    ax.set_ylabel('Density')
    ax.grid(True, alpha=0.3)
    
    # Plot 4: Kernel matrix visualization
    ax = axes[1, 1]
    im = ax.imshow(K, cmap='viridis', aspect='auto')
    plt.colorbar(im, ax=ax)
    ax.set_title('Kernel Matrix K')
    ax.set_xlabel('Input Index i')
    ax.set_ylabel('Input Index j')
    
    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print(f"\n{title} - Summary Statistics:")
    print(f"  Prior variance at each point: {prior_std[0]:.4f}²")
    print(f"  Sample range: [{samples.min():.3f}, {samples.max():.3f}]")
    print(f"  90% of samples between: [{sample_percentiles[0].min():.3f}, {sample_percentiles[4].max():.3f}]")
 
# Example: Check different priors
np.random.seed(42)
X = np.linspace(-5, 5, 100).reshape(-1, 1)
 
# Prior 1: Good for smooth, bounded functions
def kernel_smooth(x1, x2):
    return np.exp(-(x1 - x2)**2 / (2 * 2.0**2))
 
# Prior 2: Good for rough, rapidly varying functions  
def kernel_rough(x1, x2):
    r = np.abs(x1 - x2)
    return np.exp(-r / 0.5)  # Matérn-1/2
 
# Prior 3: Good for periodic phenomena
def kernel_seasonal(x1, x2):
    r = np.abs(x1 - x2)
    se = np.exp(-r**2 / (2 * 5**2))
    per = np.exp(-2 * np.sin(np.pi * r / 2)**2 / 0.5**2)
    return se * per
 
comprehensive_prior_check(X, kernel_smooth, title="Smooth Prior (SE, ℓ=2)")
comprehensive_prior_check(X, kernel_rough, title="Rough Prior (Matérn-1/2, ℓ=0.5)")
comprehensive_prior_check(X, kernel_seasonal, title="Seasonal Prior (SE × Periodic)")

Golden Rule of Prior Specification

If you don't like the prior samples, you won't like the posterior either. The prior shapes the posterior, especially with limited data. Always visualize prior samples before fitting. If they look unreasonable, revise the prior before proceeding.

Summary: Prior Specification

We've explored the rich landscape of GP prior specification, learning how to encode domain knowledge through mean functions and kernels.

Key Takeaways

•Mean Functions Set Location — The mean function $m(x)$ specifies where we expect the function to be. Zero mean is common; linear or polynomial means encode trends.
•Kernels Control Everything Else — The kernel $k(x, x')$ encodes smoothness, correlation structure, periodicity, and all other function properties.
•Key Hyperparameters — Signal variance $\sigma^2$ controls magnitude; length scale $\ell$ controls smoothness; domain-specific parameters (period, smoothness order) encode structure.
•Kernels Compose — Sums of kernels represent additive structure; products represent interaction. Complex priors emerge from simple building blocks.
•Domain Knowledge Matters — Principled prior specification translates domain expertise into mathematical form. This is the primary mechanism for injecting knowledge into GPs.
•Visualize Before Fitting — Prior predictive checks reveal problems before they contaminate inference. If samples look wrong, revise the prior.

What's Next:

With prior specification mastered, we're ready for the heart of GP inference: posterior inference. Given data, how do we update our beliefs about the unknown function? The next page derives the closed-form posterior distribution and shows how conditioning on observations transforms our prior into predictions.

Page Complete

You now understand how to specify GP priors through careful choice of mean functions and kernels. This skill—translating domain knowledge into prior distributions—is fundamental to effective Bayesian modeling. Next, we'll learn how to update these priors with data through posterior inference.

2 / 5

Loading learning content...

Machine LearningKernel Methods for Regression

Gaussian Processes Introduction

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

2 / 5

Prior Specification

The Art of Encoding Domain Knowledge

What You Will Learn

The Mean Function

The mean function $m(x)$ specifies our prior expectation of the function value at each input:

$$\mathbb{E}[f(x)] = m(x)$$

This captures our "best guess" for $f(x)$ before seeing any data. Common choices include:

Zero Mean: $m(x) = 0$

The most common choice, and often reasonable. A zero mean function says: "Before seeing data, I expect the function to fluctuate around zero, with the kernel controlling the fluctuations."

This choice is justified when:

Data will be centered (mean-subtracted)
We have no strong prior belief about the function's "location"
The kernel's variance parameter $\sigma^2$ captures expected magnitude

Constant Mean: $m(x) = c$

A simple generalization: $c$ becomes a hyperparameter to estimate. This shifts the function vertically: $$f(x) \sim \mathcal{GP}(c, k)$$

Linear Mean: $m(x) = \beta_0 + \beta_1 x$

When we believe the function has an underlying linear trend, we can encode this:

In 1D: $m(x) = \beta_0 + \beta_1 x$
In $d$ dimensions: $m(\mathbf{x}) = \beta_0 + \boldsymbol{\beta}^T \mathbf{x}$

The GP then models deviations from this linear trend.

Mean Function Philosophy

mean_functions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import matplotlib.pyplot as plt
 
# Mean function implementations
def zero_mean(X):
    """Zero mean function: m(x) = 0"""
    return np.zeros(len(X))
 
def constant_mean(X, c=2.0):
    """Constant mean function: m(x) = c"""
    return c * np.ones(len(X))
 
def linear_mean(X, beta0=0.5, beta1=0.3):
    """Linear mean function: m(x) = β₀ + β₁x"""
    return beta0 + beta1 * X.ravel()
 
def quadratic_mean(X, beta0=0.0, beta1=0.0, beta2=0.1):
    """Quadratic mean function: m(x) = β₀ + β₁x + β₂x²"""
    x = X.ravel()
    return beta0 + beta1 * x + beta2 * x**2
 
# Kernel function for generating samples
def se_kernel_matrix(X, sigma=1.0, length_scale=1.0):
    """Squared Exponential kernel matrix"""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            sqdist = np.sum((X[i] - X[j]) ** 2)
            K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2))
    return K + 1e-8 * np.eye(n)
 
# Visualization
np.random.seed(42)
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
mean_functions = [
    ('Zero Mean: m(x) = 0', zero_mean),
    ('Constant Mean: m(x) = 2', lambda X: constant_mean(X, c=2.0)),
    ('Linear Mean: m(x) = 0.5 + 0.3x', lambda X: linear_mean(X, 0.5, 0.3)),
    ('Quadratic Mean: m(x) = 0.1x²', lambda X: quadratic_mean(X, 0, 0, 0.1))
]
 
K = se_kernel_matrix(X, sigma=0.8, length_scale=1.5)
L = np.linalg.cholesky(K)
 
for ax, (title, mean_func) in zip(axes.ravel(), mean_functions):
    mean = mean_func(X)
    std = np.sqrt(np.diag(K))
    
    # Sample functions
    samples = mean[:, np.newaxis] + L @ np.random.randn(len(X), 5)
    
    # Plot
    ax.fill_between(X.ravel(), mean - 2*std, mean + 2*std, 
                     alpha=0.15, color='C0')
    ax.plot(X, mean, 'C3-', linewidth=2, label='Mean function')
    for i in range(5):
        ax.plot(X, samples[:, i], alpha=0.6, linewidth=1)
    ax.set_title(title)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-5, 5)
 
plt.suptitle('GP Priors with Different Mean Functions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Practical Guidelines for Mean Functions:

Start Simple: Zero mean is often sufficient. Only add complexity if you have strong prior beliefs or the problem demands it.
Consider Data Preprocessing: Often, it's cleaner to center and scale data, then use a zero-mean GP, rather than trying to capture scale through the mean function.
Mean Functions Add Parameters: A linear mean adds $d+1$ hyperparameters in $d$ dimensions. More complex means increase the hyperparameter burden.
Posterior Updates Mean: After conditioning on data, the posterior mean will deviate from the prior mean. With sufficient data, the prior mean matters less.
Physical Interpretation: In scientific applications, mean functions can encode known physics (e.g., a baseline behavior, an expected trend from theory).

The Kernel Function — Core Concepts

Formal Properties:

$$\mathbf{v}^T K \mathbf{v} \geq 0 \quad \forall \mathbf{v} \in \mathbb{R}^n$$

This ensures the "covariance matrix" is valid. Mercer's theorem guarantees that any kernel satisfying certain continuity conditions is PSD.

What the Kernel Controls:

Kernel Controls Function Properties

•Variance — $k(x, x)$ gives the prior variance of $f(x)$. Higher values mean larger expected function magnitudes.
•Correlation Decay — How $k(x, x')$ decreases as $|x - x'|$ increases determines how quickly function values become uncorrelated.
•Smoothness — The differentiability of $k$ determines the smoothness of sampled functions. Infinitely differentiable kernels yield infinitely smooth functions.
•Stationarity — If $k(x, x') = k(x - x')$, the kernel is stationary: correlation depends only on displacement, not absolute position.
•Isotropy — If $k(x, x') = k(|x - x'|)$, the kernel is isotropic: correlation is the same in all directions.
•Periodicity — Some kernels encode periodic structure, causing correlation to oscillate with distance.

Kernels Define Function Space

Key Kernel Properties and Their Implications
Property	Mathematical Condition	Effect on Functions	Example
Stationarity	$k(x, x') = k(x - x')$	Translation invariance	SE, Matérn, Periodic
Isotropy	$k(x, x') = k(\|x - x'\|)$	Rotation invariance	SE, Matérn (isotropic)
Infinite differentiability	$k \in C^\infty$	Infinitely smooth functions	SE kernel
Finite differentiability	$k \in C^{\nu}$ only	Finite smoothness	Matérn-$\nu$ kernel
Compact support	$k(x,x') = 0$ for $\|x-x'\| > r$	Sparse kernel matrices	Wendland kernels

Common Kernel Families

Let's examine the most important kernel families in detail, understanding their properties and use cases.

1. Squared Exponential (SE) / Radial Basis Function (RBF):

$$k_{SE}(x, x') = \sigma^2 \exp\left( -\frac{|x - x'|^2}{2\ell^2} \right)$$

Properties:

Infinitely differentiable → Infinitely smooth functions
Stationary and isotropic
Parameters: $\sigma^2$ (variance), $\ell$ (length scale)

The SE kernel is the "default" choice but often too smooth for real data.

2. Matérn Family:

$$k_{Matérn}(r) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} r}{\ell} \right)^{\nu} K_{\nu}\left( \frac{\sqrt{2\nu} r}{\ell} \right)$$

where $r = |x - x'|$ and $K_\nu$ is the modified Bessel function.

Properties:

$\nu$ controls smoothness: functions are $\lceil \nu \rceil - 1$ times differentiable
Special cases:
- $\nu = 1/2$ (Ornstein-Uhlenbeck): exponential decay, non-differentiable (rough)
- $\nu = 3/2$: once differentiable
- $\nu = 5/2$: twice differentiable
- $\nu \to \infty$: recovers SE kernel

The Matérn-5/2 kernel is often a good default—smooth but not unrealistically so.

kernel_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import gamma, kv  # For Matérn kernel
 
def se_kernel(x1, x2, sigma=1.0, length_scale=1.0):
    """Squared Exponential kernel"""
    r = np.linalg.norm(x1 - x2)
    return sigma**2 * np.exp(-r**2 / (2 * length_scale**2))
 
def matern_kernel(x1, x2, sigma=1.0, length_scale=1.0, nu=2.5):
    """Matérn kernel with smoothness parameter nu"""
    r = np.linalg.norm(x1 - x2)
    if r == 0:
        return sigma**2
    
    sqrt_2nu_r_l = np.sqrt(2 * nu) * r / length_scale
    coeff = sigma**2 * (2**(1-nu)) / gamma(nu)
    return coeff * (sqrt_2nu_r_l**nu) * kv(nu, sqrt_2nu_r_l)
 
def matern_12(x1, x2, sigma=1.0, length_scale=1.0):
    """Matérn-1/2 (Ornstein-Uhlenbeck / Exponential)"""
    r = np.linalg.norm(x1 - x2)
    return sigma**2 * np.exp(-r / length_scale)
 
def matern_32(x1, x2, sigma=1.0, length_scale=1.0):
    """Matérn-3/2 (once differentiable)"""
    r = np.linalg.norm(x1 - x2)
    sqrt3_r_l = np.sqrt(3) * r / length_scale
    return sigma**2 * (1 + sqrt3_r_l) * np.exp(-sqrt3_r_l)
 
def matern_52(x1, x2, sigma=1.0, length_scale=1.0):
    """Matérn-5/2 (twice differentiable)"""
    r = np.linalg.norm(x1 - x2)
    sqrt5_r_l = np.sqrt(5) * r / length_scale
    return sigma**2 * (1 + sqrt5_r_l + sqrt5_r_l**2/3) * np.exp(-sqrt5_r_l)
 
def periodic_kernel(x1, x2, sigma=1.0, length_scale=1.0, period=1.0):
    """Periodic kernel for repeating patterns"""
    r = np.linalg.norm(x1 - x2)
    return sigma**2 * np.exp(-2 * np.sin(np.pi * r / period)**2 / length_scale**2)
 
def build_kernel_matrix(X, kernel_func, **params):
    """Build kernel matrix from kernel function"""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j], **params)
    return K + 1e-8 * np.eye(n)
 
# Visualization
np.random.seed(42)
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
 
kernels = [
    ('Squared Exponential', se_kernel, {'length_scale': 1.0}),
    ('Matérn-1/2 (Rough)', matern_12, {'length_scale': 1.0}),
    ('Matérn-3/2 (Once Diff.)', matern_32, {'length_scale': 1.0}),
    ('Matérn-5/2 (Twice Diff.)', matern_52, {'length_scale': 1.0}),
    ('Periodic (p=2)', periodic_kernel, {'length_scale': 0.5, 'period': 2.0}),
    ('SE Short Length (ℓ=0.3)', se_kernel, {'length_scale': 0.3}),
]
 
for ax, (name, kernel_func, params) in zip(axes.ravel(), kernels):
    K = build_kernel_matrix(X, kernel_func, sigma=1.0, **params)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7)
    ax.set_title(name)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-3, 3)
 
plt.suptitle('GP Samples with Different Kernels', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

3. Periodic Kernel:

$$k_{Per}(x, x') = \sigma^2 \exp\left( -\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2} \right)$$

Properties:

Enforces exact periodicity with period $p$
Points separated by integer multiples of $p$ have maximum correlation
Often combined with SE to allow periodicity to vary

4. Rational Quadratic (RQ):

$$k_{RQ}(x, x') = \sigma^2 \left( 1 + \frac{|x-x'|^2}{2\alpha\ell^2} \right)^{-\alpha}$$

Properties:

Equivalent to an infinite mixture of SE kernels with varying length scales
$\alpha$ controls how sharp vs. smooth the decay is
$\alpha \to \infty$ recovers SE kernel
Useful when multiple length scales present

5. Linear (Polynomial) Kernels:

$$k_{Lin}(x, x') = \sigma_b^2 + \sigma_v^2 (x - c)(x' - c)$$

Properties:

Non-stationary: variance grows away from center $c$
Samples are linear functions with random slopes
Polynomial kernels $k(x, x') = (\langle x, x' \rangle + c)^d$ give polynomial functions

Kernel Hyperparameters

Every kernel family has hyperparameters that control its behavior. Understanding these parameters is essential for both specifying priors and interpreting learned models.

Signal Variance $\sigma^2$ (or $\sigma_f^2$):

The signal variance controls the amplitude of function variation. It appears as a multiplicative factor: $$k(x, x') = \sigma^2 \cdot \tilde{k}(x, x')$$

Higher $\sigma^2$ → functions with larger magnitude fluctuations
Lower $\sigma^2$ → functions closer to the mean
At any point $x$: $\text{Var}(f(x)) = k(x, x) = \sigma^2$ (for normalized base kernels)

Length Scale $\ell$:

The length scale controls how quickly correlation decays with distance:

Large $\ell$ → function values at distant points remain correlated → smooth functions
Small $\ell$ → correlation decays rapidly → rough functions with rapid variation

Intuitively, $\ell$ answers: "How far do I need to move in input space before the function value becomes essentially uncorrelated?"

Length Scale Interpretation

hyperparameter_effects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
 
def se_kernel_matrix(X, sigma=1.0, length_scale=1.0):
    """SE kernel matrix construction"""
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            r2 = np.sum((X[i] - X[j])**2)
            K[i, j] = sigma**2 * np.exp(-r2 / (2 * length_scale**2))
    return K + 1e-8 * np.eye(n)
 
np.random.seed(42)
X = np.linspace(-5, 5, 150).reshape(-1, 1)
 
# Effect of length scale
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
 
# Row 1: Varying length scale (fixed sigma=1.0)
length_scales = [0.3, 1.0, 3.0]
for ax, ls in zip(axes[0], length_scales):
    K = se_kernel_matrix(X, sigma=1.0, length_scale=ls)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7)
    ax.set_title(f'Length Scale ℓ = {ls}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim(-4, 4)
    ax.grid(True, alpha=0.3)
 
# Row 2: Varying sigma (fixed length_scale=1.0)
sigmas = [0.5, 1.0, 2.0]
for ax, sig in zip(axes[1], sigmas):
    K = se_kernel_matrix(X, sigma=sig, length_scale=1.0)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7)
    ax.set_title(f'Signal Variance σ = {sig}')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_ylim(-6, 6)
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Effect of Hyperparameters on GP Samples', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
 
# Visualization of correlation decay
fig, ax = plt.subplots(figsize=(10, 6))
distances = np.linspace(0, 5, 100)
 
for ls in [0.5, 1.0, 2.0]:
    correlation = np.exp(-distances**2 / (2 * ls**2))
    ax.plot(distances, correlation, linewidth=2, label=f'ℓ = {ls}')
    # Mark length scale
    ax.axhline(y=np.exp(-0.5), color='gray', linestyle='--', alpha=0.3)
    ax.axvline(x=ls, color='gray', linestyle=':', alpha=0.3)
 
ax.set_xlabel('Distance |x - x'|')
ax.set_ylabel('Correlation')
ax.set_title('Correlation Decay for Different Length Scales')
ax.legend()
ax.grid(True, alpha=0.3)
ax.annotate('Correlation = 0.61 (at ℓ)', xy=(1.0, np.exp(-0.5)), 
            xytext=(2.5, 0.5), arrowprops=dict(arrowstyle='->', color='gray'))
plt.show()

Anisotropic Length Scales (ARD):

In multi-dimensional inputs, we can assign different length scales to each dimension:

$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left( -\frac{1}{2} \sum_{d=1}^{D} \frac{(x_d - x_d')^2}{\ell_d^2} \right)$$

This is called Automatic Relevance Determination (ARD):

Large $\ell_d$ → the function varies slowly along dimension $d$ → dimension $d$ is less "relevant"
Small $\ell_d$ → rapid variation along dimension $d$ → dimension $d$ is highly "relevant"

ARD provides automatic feature selection: irrelevant dimensions will have their length scales pushed to infinity during optimization, effectively removing them from the model.

Common Hyperparameters and Their Effects
Hyperparameter	Symbol	Range	Effect When Increased
Signal Variance	$\sigma^2$	$(0, \infty)$	Larger function magnitude; wider confidence bands
Length Scale	$\ell$	$(0, \infty)$	Smoother functions; slower variation
Noise Variance	$\sigma_n^2$	$(0, \infty)$	More interpolation; less exact fit to data
Matérn Smoothness	$\nu$	$(0, \infty)$	More differentiable functions
Period	$p$	$(0, \infty)$	Longer period for periodic patterns

Combining Kernels

Kernel Closure Properties:

If $k_1$ and $k_2$ are valid (PSD) kernels, so are:

1. Sum: $k(x, x') = k_1(x, x') + k_2(x, x')$

Interpretation: $f = f_1 + f_2$ where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$
Functions have characteristics of both kernels
Example: Linear trend + smooth deviations = Linear + SE

2. Product: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$

Interpretation: Correlation requires agreement under both kernels
Example: SE $\times$ Periodic = locally periodic (periodicity that can evolve)

3. Scaling: $k(x, x') = c \cdot k_1(x, x')$ for constant $c > 0$

4. Composition with function: $k(x, x') = k_1(g(x), g(x'))$ for any function $g$

Example: Warping input space to handle non-stationarity

Additive vs Multiplicative Combination

kernel_combinations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
 
def se_kernel(x1, x2, sigma=1.0, length_scale=1.0):
    r2 = np.sum((x1 - x2)**2)
    return sigma**2 * np.exp(-r2 / (2 * length_scale**2))
 
def periodic_kernel(x1, x2, sigma=1.0, length_scale=1.0, period=1.0):
    r = np.abs(x1 - x2).sum()
    return sigma**2 * np.exp(-2 * np.sin(np.pi * r / period)**2 / length_scale**2)
 
def linear_kernel(x1, x2, sigma_b=1.0, sigma_v=1.0, c=0.0):
    return sigma_b**2 + sigma_v**2 * (x1 - c) * (x2 - c)
 
def build_kernel_matrix(X, kernel_func, **params):
    n = len(X)
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i].item(), X[j].item(), **params)
    return K + 1e-8 * np.eye(n)
 
# Combined kernels
def linear_plus_se(x1, x2):
    """Linear trend + smooth deviations"""
    return linear_kernel(x1, x2, sigma_v=0.3) + se_kernel(x1, x2, sigma=1.0, length_scale=1.0)
 
def se_times_periodic(x1, x2):
    """Locally periodic: periodic structure that can evolve"""
    return se_kernel(x1, x2, sigma=1.0, length_scale=3.0) * periodic_kernel(x1, x2, length_scale=0.5, period=2.0)
 
def se_plus_periodic(x1, x2):
    """Smooth + periodic: long-term trend with seasonal component"""
    return se_kernel(x1, x2, sigma=1.0, length_scale=2.0) + periodic_kernel(x1, x2, sigma=0.5, length_scale=0.5, period=1.5)
 
def se_plus_se(x1, x2):
    """Multi-scale: both local and global structure"""
    return se_kernel(x1, x2, sigma=0.5, length_scale=0.5) + se_kernel(x1, x2, sigma=1.0, length_scale=3.0)
 
# Visualization
np.random.seed(42)
X = np.linspace(-5, 5, 200).reshape(-1, 1)
 
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
combos = [
    ('Linear + SE (trend + deviations)', linear_plus_se),
    ('SE × Periodic (evolving periodicity)', se_times_periodic),
    ('SE + Periodic (trend + seasonal)', se_plus_periodic),
    ('SE(short) + SE(long) (multi-scale)', se_plus_se),
]
 
for ax, (name, kernel_func) in zip(axes.ravel(), combos):
    K = build_kernel_matrix(X, kernel_func)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    for s in samples.T:
        ax.plot(X, s, alpha=0.7, linewidth=1.5)
    ax.set_title(name)
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
 
plt.suptitle('GP Samples with Composite Kernels', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Common Composite Kernels:

Long-term + Short-term (Multi-scale): $$k = k_{SE}(\ell_{long}) + k_{SE}(\ell_{short})$$ Captures both global trends and local wiggles.
Trend + Seasonality: $$k = k_{Linear} + k_{Periodic}$$ Common in time series: linear growth with periodic fluctuations.
Locally Periodic: $$k = k_{SE} \times k_{Periodic}$$ Periodicity that can change amplitude over time.
Non-stationary via Warping: $$k(x, x') = k_{SE}(g(x), g(x'))$$ Where $g$ is a warping function (e.g., log, sigmoid).
Additive Structure: $$k(\mathbf{x}, \mathbf{x}') = \sum_d k_d(x_d, x_d')$$ For high-dimensional inputs where the function is a sum of univariate functions.

Expressing Domain Knowledge

The true power of GP prior specification lies in encoding domain knowledge. The kernel is a language for expressing what we believe about the unknown function before seeing data.

Smoothness Beliefs:

"The function should be very smooth" → SE kernel or Matérn with large $\nu$
"The function can be rough with occasional discontinuities" → Matérn-1/2 or mixture
"The function has exactly $k$ continuous derivatives" → Matérn-$\nu$ with $\nu = k + 1/2$

Scale Beliefs:

"Function values typically range from -5 to 5" → Set $\sigma \approx 2.5$ (so 95% within $\pm 2\sigma$)
"The function changes significantly over distances of ~10 units" → Set $\ell \approx 10$
"The function varies on multiple scales" → Sum of kernels with different $\ell$ values

Structural Beliefs:

"There's a seasonal component with period 12" → Include $k_{Periodic}(p=12)$
"There's an underlying linear trend" → Include $k_{Linear}$
"The function is more variable in some regions" → Non-stationary kernel or input warping

Prior Specification is Not Curve Fitting

Translating Domain Knowledge to Kernel Choices
Domain Belief	Kernel Choice	Hyperparameter Setting
Physical processes are continuous	Matérn-5/2 or SE	Standard length scale from physics
Time series with daily seasonality	Periodic × SE	Period = 24 hours; SE for drift
Stock prices (random walk-like)	Linear kernel + noise	Variance matching historical volatility
Sensor noise (rough, continuous)	Matérn-3/2	Short length scale for rapid variation
Spatial interpolation	SE with ARD	Separate length scales per coordinate
Multi-fidelity experiments	Sum/product of kernels	Different kernels for different fidelities

A Principled Workflow for Prior Specification:

List your beliefs: Before looking at data, write down what you expect. Is the function smooth? Does it have trends? Seasonality? What's the typical magnitude?
Translate to kernels: Map each belief to a kernel component. Combine with sums (additive structure) or products (interactive structure).
Set hyperparameter ranges: Based on domain knowledge, determine plausible ranges for each hyperparameter.
Visualize prior samples: Draw samples from your GP prior. Do they look reasonable? If not, revise.
Document your choices: For reproducibility and scientific rigor, record why you made each choice.
Plan for learning: Decide which hyperparameters to estimate from data (optimize or marginalize) vs. fix from domain knowledge.

Prior Sampling as Validation

One of the most valuable practices in GP modeling is prior predictive checking: sampling functions from your prior and examining whether they look plausible before seeing any data.

This practice catches several common issues:

1. Unreasonable Smoothness:

Prior samples too wiggly? → Increase length scale
Prior samples too smooth? → Decrease length scale or use less smooth kernel

2. Wrong Scale:

Samples have magnitude 0.001 but we expect values in thousands? → Increase $\sigma$
Samples span huge range but data is bounded? → Decrease $\sigma$

3. Missing Structure:

We expect periodicity but samples don't show it? → Add periodic kernel
We expect trend but samples fluctuate around zero? → Add linear kernel or non-zero mean

4. Wrong Complexity:

Samples capture only simple patterns but we need complex ones? → Combine kernels
Samples too complex for our data quantity? → Simplify or add stronger priors

prior_checking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
import matplotlib.pyplot as plt
 
def comprehensive_prior_check(X, kernel_func, n_samples=20, title="Prior Check"):
    """
    Comprehensive prior predictive check for a GP.
    Plots samples, mean, and credible intervals.
    """
    n = len(X)
    
    # Build kernel matrix
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i].item(), X[j].item())
    K += 1e-8 * np.eye(n)
    
    # Sample from prior
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(n, n_samples)
    
    # Compute statistics
    prior_mean = np.zeros(n)  # Zero mean GP
    prior_std = np.sqrt(np.diag(K))
    sample_percentiles = np.percentile(samples, [5, 25, 50, 75, 95], axis=1)
    
    # Create figure
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: All samples
    ax = axes[0, 0]
    for i in range(min(10, n_samples)):
        ax.plot(X, samples[:, i], alpha=0.5)
    ax.axhline(0, color='k', linestyle='--', alpha=0.3)
    ax.set_title('Prior Samples (10 shown)')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Mean and credible bands (from samples)
    ax = axes[0, 1]
    ax.fill_between(X.ravel(), sample_percentiles[0], sample_percentiles[4], 
                     alpha=0.15, label='90% CI', color='C0')
    ax.fill_between(X.ravel(), sample_percentiles[1], sample_percentiles[3], 
                     alpha=0.25, label='50% CI', color='C0')
    ax.plot(X, sample_percentiles[2], 'C0-', linewidth=2, label='Median')
    ax.plot(X, prior_mean, 'k--', linewidth=1, label='Prior Mean')
    ax.set_title('Prior Predictive Distribution')
    ax.legend()
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.grid(True, alpha=0.3)
    
    # Plot 3: Histogram of function values at a single point
    ax = axes[1, 0]
    mid_idx = n // 2
    ax.hist(samples[mid_idx, :], bins=20, density=True, alpha=0.7, color='C1')
    x_vals = np.linspace(samples[mid_idx, :].min(), samples[mid_idx, :].max(), 100)
    ax.plot(x_vals, np.exp(-x_vals**2 / (2 * prior_std[mid_idx]**2)) / 
            (prior_std[mid_idx] * np.sqrt(2*np.pi)), 'k-', linewidth=2)
    ax.set_title(f'Marginal at x = {X[mid_idx].item():.2f}')
    ax.set_xlabel('f(x)')
    ax.set_ylabel('Density')
    ax.grid(True, alpha=0.3)
    
    # Plot 4: Kernel matrix visualization
    ax = axes[1, 1]
    im = ax.imshow(K, cmap='viridis', aspect='auto')
    plt.colorbar(im, ax=ax)
    ax.set_title('Kernel Matrix K')
    ax.set_xlabel('Input Index i')
    ax.set_ylabel('Input Index j')
    
    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print(f"\n{title} - Summary Statistics:")
    print(f"  Prior variance at each point: {prior_std[0]:.4f}²")
    print(f"  Sample range: [{samples.min():.3f}, {samples.max():.3f}]")
    print(f"  90% of samples between: [{sample_percentiles[0].min():.3f}, {sample_percentiles[4].max():.3f}]")
 
# Example: Check different priors
np.random.seed(42)
X = np.linspace(-5, 5, 100).reshape(-1, 1)
 
# Prior 1: Good for smooth, bounded functions
def kernel_smooth(x1, x2):
    return np.exp(-(x1 - x2)**2 / (2 * 2.0**2))
 
# Prior 2: Good for rough, rapidly varying functions  
def kernel_rough(x1, x2):
    r = np.abs(x1 - x2)
    return np.exp(-r / 0.5)  # Matérn-1/2
 
# Prior 3: Good for periodic phenomena
def kernel_seasonal(x1, x2):
    r = np.abs(x1 - x2)
    se = np.exp(-r**2 / (2 * 5**2))
    per = np.exp(-2 * np.sin(np.pi * r / 2)**2 / 0.5**2)
    return se * per
 
comprehensive_prior_check(X, kernel_smooth, title="Smooth Prior (SE, ℓ=2)")
comprehensive_prior_check(X, kernel_rough, title="Rough Prior (Matérn-1/2, ℓ=0.5)")
comprehensive_prior_check(X, kernel_seasonal, title="Seasonal Prior (SE × Periodic)")

Golden Rule of Prior Specification

Summary: Prior Specification

We've explored the rich landscape of GP prior specification, learning how to encode domain knowledge through mean functions and kernels.

Key Takeaways

•Mean Functions Set Location — The mean function $m(x)$ specifies where we expect the function to be. Zero mean is common; linear or polynomial means encode trends.
•Kernels Control Everything Else — The kernel $k(x, x')$ encodes smoothness, correlation structure, periodicity, and all other function properties.
•Key Hyperparameters — Signal variance $\sigma^2$ controls magnitude; length scale $\ell$ controls smoothness; domain-specific parameters (period, smoothness order) encode structure.
•Kernels Compose — Sums of kernels represent additive structure; products represent interaction. Complex priors emerge from simple building blocks.
•Domain Knowledge Matters — Principled prior specification translates domain expertise into mathematical form. This is the primary mechanism for injecting knowledge into GPs.
•Visualize Before Fitting — Prior predictive checks reveal problems before they contaminate inference. If samples look wrong, revise the prior.

What's Next:

Page Complete

2 / 5