Loading learning content...
In the previous page, we established that a Gaussian Process is fully specified by two functions: a mean function $m(x)$ and a kernel function $k(x, x')$. These functions encode our prior beliefs about the unknown function $f$ before observing any data.
But how do we choose these functions? The answer lies at the intersection of mathematical rigor and domain expertise. A well-specified prior can dramatically improve predictions with limited data; a poorly specified prior can mislead us despite abundant observations.
This page explores the rich landscape of mean and kernel function design. We'll examine how different choices express different beliefs about function behavior—smoothness, periodicity, trends, and locality. By the end, you'll have the vocabulary and tools to construct GP priors tailored to specific problems.
By the end of this page, you will understand how to design mean functions that encode trend beliefs, how kernel functions control smoothness and correlation structure, the major families of kernels and their properties, and how to combine kernels to express complex prior beliefs. You will develop the judgment needed to select appropriate priors for real problems.
The mean function $m(x)$ specifies our prior expectation of the function value at each input:
$$\mathbb{E}[f(x)] = m(x)$$
This captures our "best guess" for $f(x)$ before seeing any data. Common choices include:
Zero Mean: $m(x) = 0$
The most common choice, and often reasonable. A zero mean function says: "Before seeing data, I expect the function to fluctuate around zero, with the kernel controlling the fluctuations."
This choice is justified when:
Constant Mean: $m(x) = c$
A simple generalization: $c$ becomes a hyperparameter to estimate. This shifts the function vertically: $$f(x) \sim \mathcal{GP}(c, k)$$
Linear Mean: $m(x) = \beta_0 + \beta_1 x$
When we believe the function has an underlying linear trend, we can encode this:
The GP then models deviations from this linear trend.
There's a philosophical question: should we model trend in the mean function or in the kernel? A linear mean explicitly separates trend from fluctuations. Alternatively, a kernel that doesn't decay to zero at infinity (like polynomial kernels) can capture trend. The choice affects interpretability and hyperparameter meaning, but with enough data, both approaches can fit the same functions.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npimport matplotlib.pyplot as plt # Mean function implementationsdef zero_mean(X): """Zero mean function: m(x) = 0""" return np.zeros(len(X)) def constant_mean(X, c=2.0): """Constant mean function: m(x) = c""" return c * np.ones(len(X)) def linear_mean(X, beta0=0.5, beta1=0.3): """Linear mean function: m(x) = β₀ + β₁x""" return beta0 + beta1 * X.ravel() def quadratic_mean(X, beta0=0.0, beta1=0.0, beta2=0.1): """Quadratic mean function: m(x) = β₀ + β₁x + β₂x²""" x = X.ravel() return beta0 + beta1 * x + beta2 * x**2 # Kernel function for generating samplesdef se_kernel_matrix(X, sigma=1.0, length_scale=1.0): """Squared Exponential kernel matrix""" n = len(X) K = np.zeros((n, n)) for i in range(n): for j in range(n): sqdist = np.sum((X[i] - X[j]) ** 2) K[i, j] = sigma**2 * np.exp(-sqdist / (2 * length_scale**2)) return K + 1e-8 * np.eye(n) # Visualizationnp.random.seed(42)X = np.linspace(-5, 5, 200).reshape(-1, 1) fig, axes = plt.subplots(2, 2, figsize=(14, 10))mean_functions = [ ('Zero Mean: m(x) = 0', zero_mean), ('Constant Mean: m(x) = 2', lambda X: constant_mean(X, c=2.0)), ('Linear Mean: m(x) = 0.5 + 0.3x', lambda X: linear_mean(X, 0.5, 0.3)), ('Quadratic Mean: m(x) = 0.1x²', lambda X: quadratic_mean(X, 0, 0, 0.1))] K = se_kernel_matrix(X, sigma=0.8, length_scale=1.5)L = np.linalg.cholesky(K) for ax, (title, mean_func) in zip(axes.ravel(), mean_functions): mean = mean_func(X) std = np.sqrt(np.diag(K)) # Sample functions samples = mean[:, np.newaxis] + L @ np.random.randn(len(X), 5) # Plot ax.fill_between(X.ravel(), mean - 2*std, mean + 2*std, alpha=0.15, color='C0') ax.plot(X, mean, 'C3-', linewidth=2, label='Mean function') for i in range(5): ax.plot(X, samples[:, i], alpha=0.6, linewidth=1) ax.set_title(title) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.legend() ax.grid(True, alpha=0.3) ax.set_ylim(-5, 5) plt.suptitle('GP Priors with Different Mean Functions', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()Practical Guidelines for Mean Functions:
Start Simple: Zero mean is often sufficient. Only add complexity if you have strong prior beliefs or the problem demands it.
Consider Data Preprocessing: Often, it's cleaner to center and scale data, then use a zero-mean GP, rather than trying to capture scale through the mean function.
Mean Functions Add Parameters: A linear mean adds $d+1$ hyperparameters in $d$ dimensions. More complex means increase the hyperparameter burden.
Posterior Updates Mean: After conditioning on data, the posterior mean will deviate from the prior mean. With sufficient data, the prior mean matters less.
Physical Interpretation: In scientific applications, mean functions can encode known physics (e.g., a baseline behavior, an expected trend from theory).
The kernel function $k(x, x')$ is the heart of a Gaussian Process. While the mean function sets where we expect the function to be, the kernel determines how values at different inputs relate to each other—the correlation structure of the function.
Formal Properties:
A valid kernel must produce positive semi-definite (PSD) covariance matrices for any finite collection of inputs. That is, for any inputs ${x_1, \ldots, x_n}$, the matrix $K$ with entries $K_{ij} = k(x_i, x_j)$ must satisfy:
$$\mathbf{v}^T K \mathbf{v} \geq 0 \quad \forall \mathbf{v} \in \mathbb{R}^n$$
This ensures the "covariance matrix" is valid. Mercer's theorem guarantees that any kernel satisfying certain continuity conditions is PSD.
What the Kernel Controls:
The kernel implicitly defines a space of functions—the Reproducing Kernel Hilbert Space (RKHS). Functions with small RKHS norm are 'typical' under the GP prior; functions with large norm are 'atypical.' This connection to functional analysis provides theoretical grounding for kernel methods.
| Property | Mathematical Condition | Effect on Functions | Example |
|---|---|---|---|
| Stationarity | $k(x, x') = k(x - x')$ | Translation invariance | SE, Matérn, Periodic |
| Isotropy | $k(x, x') = k(|x - x'|)$ | Rotation invariance | SE, Matérn (isotropic) |
| Infinite differentiability | $k \in C^\infty$ | Infinitely smooth functions | SE kernel |
| Finite differentiability | $k \in C^{\nu}$ only | Finite smoothness | Matérn-$\nu$ kernel |
| Compact support | $k(x,x') = 0$ for $|x-x'| > r$ | Sparse kernel matrices | Wendland kernels |
Let's examine the most important kernel families in detail, understanding their properties and use cases.
1. Squared Exponential (SE) / Radial Basis Function (RBF):
$$k_{SE}(x, x') = \sigma^2 \exp\left( -\frac{|x - x'|^2}{2\ell^2} \right)$$
Properties:
The SE kernel is the "default" choice but often too smooth for real data.
2. Matérn Family:
$$k_{Matérn}(r) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} r}{\ell} \right)^{\nu} K_{\nu}\left( \frac{\sqrt{2\nu} r}{\ell} \right)$$
where $r = |x - x'|$ and $K_\nu$ is the modified Bessel function.
Properties:
The Matérn-5/2 kernel is often a good default—smooth but not unrealistically so.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import gamma, kv # For Matérn kernel def se_kernel(x1, x2, sigma=1.0, length_scale=1.0): """Squared Exponential kernel""" r = np.linalg.norm(x1 - x2) return sigma**2 * np.exp(-r**2 / (2 * length_scale**2)) def matern_kernel(x1, x2, sigma=1.0, length_scale=1.0, nu=2.5): """Matérn kernel with smoothness parameter nu""" r = np.linalg.norm(x1 - x2) if r == 0: return sigma**2 sqrt_2nu_r_l = np.sqrt(2 * nu) * r / length_scale coeff = sigma**2 * (2**(1-nu)) / gamma(nu) return coeff * (sqrt_2nu_r_l**nu) * kv(nu, sqrt_2nu_r_l) def matern_12(x1, x2, sigma=1.0, length_scale=1.0): """Matérn-1/2 (Ornstein-Uhlenbeck / Exponential)""" r = np.linalg.norm(x1 - x2) return sigma**2 * np.exp(-r / length_scale) def matern_32(x1, x2, sigma=1.0, length_scale=1.0): """Matérn-3/2 (once differentiable)""" r = np.linalg.norm(x1 - x2) sqrt3_r_l = np.sqrt(3) * r / length_scale return sigma**2 * (1 + sqrt3_r_l) * np.exp(-sqrt3_r_l) def matern_52(x1, x2, sigma=1.0, length_scale=1.0): """Matérn-5/2 (twice differentiable)""" r = np.linalg.norm(x1 - x2) sqrt5_r_l = np.sqrt(5) * r / length_scale return sigma**2 * (1 + sqrt5_r_l + sqrt5_r_l**2/3) * np.exp(-sqrt5_r_l) def periodic_kernel(x1, x2, sigma=1.0, length_scale=1.0, period=1.0): """Periodic kernel for repeating patterns""" r = np.linalg.norm(x1 - x2) return sigma**2 * np.exp(-2 * np.sin(np.pi * r / period)**2 / length_scale**2) def build_kernel_matrix(X, kernel_func, **params): """Build kernel matrix from kernel function""" n = len(X) K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = kernel_func(X[i], X[j], **params) return K + 1e-8 * np.eye(n) # Visualizationnp.random.seed(42)X = np.linspace(-5, 5, 200).reshape(-1, 1) fig, axes = plt.subplots(3, 2, figsize=(14, 12)) kernels = [ ('Squared Exponential', se_kernel, {'length_scale': 1.0}), ('Matérn-1/2 (Rough)', matern_12, {'length_scale': 1.0}), ('Matérn-3/2 (Once Diff.)', matern_32, {'length_scale': 1.0}), ('Matérn-5/2 (Twice Diff.)', matern_52, {'length_scale': 1.0}), ('Periodic (p=2)', periodic_kernel, {'length_scale': 0.5, 'period': 2.0}), ('SE Short Length (ℓ=0.3)', se_kernel, {'length_scale': 0.3}),] for ax, (name, kernel_func, params) in zip(axes.ravel(), kernels): K = build_kernel_matrix(X, kernel_func, sigma=1.0, **params) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) for s in samples.T: ax.plot(X, s, alpha=0.7) ax.set_title(name) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) ax.set_ylim(-3, 3) plt.suptitle('GP Samples with Different Kernels', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()3. Periodic Kernel:
$$k_{Per}(x, x') = \sigma^2 \exp\left( -\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2} \right)$$
Properties:
4. Rational Quadratic (RQ):
$$k_{RQ}(x, x') = \sigma^2 \left( 1 + \frac{|x-x'|^2}{2\alpha\ell^2} \right)^{-\alpha}$$
Properties:
5. Linear (Polynomial) Kernels:
$$k_{Lin}(x, x') = \sigma_b^2 + \sigma_v^2 (x - c)(x' - c)$$
Properties:
Every kernel family has hyperparameters that control its behavior. Understanding these parameters is essential for both specifying priors and interpreting learned models.
Signal Variance $\sigma^2$ (or $\sigma_f^2$):
The signal variance controls the amplitude of function variation. It appears as a multiplicative factor: $$k(x, x') = \sigma^2 \cdot \tilde{k}(x, x')$$
Length Scale $\ell$:
The length scale controls how quickly correlation decays with distance:
Intuitively, $\ell$ answers: "How far do I need to move in input space before the function value becomes essentially uncorrelated?"
For the SE kernel, the correlation between f(x) and f(x') equals e^(-1/2) ≈ 0.61 when |x - x'| = ℓ. This provides a concrete interpretation: ℓ is roughly the distance over which the function can change significantly. When choosing priors, ask: 'Over what input distance do I expect the function to vary meaningfully?'
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as npimport matplotlib.pyplot as plt def se_kernel_matrix(X, sigma=1.0, length_scale=1.0): """SE kernel matrix construction""" n = len(X) K = np.zeros((n, n)) for i in range(n): for j in range(n): r2 = np.sum((X[i] - X[j])**2) K[i, j] = sigma**2 * np.exp(-r2 / (2 * length_scale**2)) return K + 1e-8 * np.eye(n) np.random.seed(42)X = np.linspace(-5, 5, 150).reshape(-1, 1) # Effect of length scalefig, axes = plt.subplots(2, 3, figsize=(15, 8)) # Row 1: Varying length scale (fixed sigma=1.0)length_scales = [0.3, 1.0, 3.0]for ax, ls in zip(axes[0], length_scales): K = se_kernel_matrix(X, sigma=1.0, length_scale=ls) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) for s in samples.T: ax.plot(X, s, alpha=0.7) ax.set_title(f'Length Scale ℓ = {ls}') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.set_ylim(-4, 4) ax.grid(True, alpha=0.3) # Row 2: Varying sigma (fixed length_scale=1.0)sigmas = [0.5, 1.0, 2.0]for ax, sig in zip(axes[1], sigmas): K = se_kernel_matrix(X, sigma=sig, length_scale=1.0) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) for s in samples.T: ax.plot(X, s, alpha=0.7) ax.set_title(f'Signal Variance σ = {sig}') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.set_ylim(-6, 6) ax.grid(True, alpha=0.3) plt.suptitle('Effect of Hyperparameters on GP Samples', fontsize=14, fontweight='bold')plt.tight_layout()plt.show() # Visualization of correlation decayfig, ax = plt.subplots(figsize=(10, 6))distances = np.linspace(0, 5, 100) for ls in [0.5, 1.0, 2.0]: correlation = np.exp(-distances**2 / (2 * ls**2)) ax.plot(distances, correlation, linewidth=2, label=f'ℓ = {ls}') # Mark length scale ax.axhline(y=np.exp(-0.5), color='gray', linestyle='--', alpha=0.3) ax.axvline(x=ls, color='gray', linestyle=':', alpha=0.3) ax.set_xlabel('Distance |x - x'|')ax.set_ylabel('Correlation')ax.set_title('Correlation Decay for Different Length Scales')ax.legend()ax.grid(True, alpha=0.3)ax.annotate('Correlation = 0.61 (at ℓ)', xy=(1.0, np.exp(-0.5)), xytext=(2.5, 0.5), arrowprops=dict(arrowstyle='->', color='gray'))plt.show()Anisotropic Length Scales (ARD):
In multi-dimensional inputs, we can assign different length scales to each dimension:
$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left( -\frac{1}{2} \sum_{d=1}^{D} \frac{(x_d - x_d')^2}{\ell_d^2} \right)$$
This is called Automatic Relevance Determination (ARD):
ARD provides automatic feature selection: irrelevant dimensions will have their length scales pushed to infinity during optimization, effectively removing them from the model.
| Hyperparameter | Symbol | Range | Effect When Increased |
|---|---|---|---|
| Signal Variance | $\sigma^2$ | $(0, \infty)$ | Larger function magnitude; wider confidence bands |
| Length Scale | $\ell$ | $(0, \infty)$ | Smoother functions; slower variation |
| Noise Variance | $\sigma_n^2$ | $(0, \infty)$ | More interpolation; less exact fit to data |
| Matérn Smoothness | $\nu$ | $(0, \infty)$ | More differentiable functions |
| Period | $p$ | $(0, \infty)$ | Longer period for periodic patterns |
One of the most powerful aspects of kernel methods is that we can combine kernels to express complex prior beliefs. Just as we can build complex probability distributions from simpler ones, we can construct expressive kernels from basic building blocks.
Kernel Closure Properties:
If $k_1$ and $k_2$ are valid (PSD) kernels, so are:
1. Sum: $k(x, x') = k_1(x, x') + k_2(x, x')$
2. Product: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$
3. Scaling: $k(x, x') = c \cdot k_1(x, x')$ for constant $c > 0$
4. Composition with function: $k(x, x') = k_1(g(x), g(x'))$ for any function $g$
Addition says: 'The function is a sum of independent components with different characteristics.' Multiplication says: 'The function must satisfy constraints from both kernels simultaneously.' Addition tends to create more complex functions; multiplication creates more constrained ones. A product of kernels is always 'at most as smooth' as the smoothest factor.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as plt def se_kernel(x1, x2, sigma=1.0, length_scale=1.0): r2 = np.sum((x1 - x2)**2) return sigma**2 * np.exp(-r2 / (2 * length_scale**2)) def periodic_kernel(x1, x2, sigma=1.0, length_scale=1.0, period=1.0): r = np.abs(x1 - x2).sum() return sigma**2 * np.exp(-2 * np.sin(np.pi * r / period)**2 / length_scale**2) def linear_kernel(x1, x2, sigma_b=1.0, sigma_v=1.0, c=0.0): return sigma_b**2 + sigma_v**2 * (x1 - c) * (x2 - c) def build_kernel_matrix(X, kernel_func, **params): n = len(X) K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = kernel_func(X[i].item(), X[j].item(), **params) return K + 1e-8 * np.eye(n) # Combined kernelsdef linear_plus_se(x1, x2): """Linear trend + smooth deviations""" return linear_kernel(x1, x2, sigma_v=0.3) + se_kernel(x1, x2, sigma=1.0, length_scale=1.0) def se_times_periodic(x1, x2): """Locally periodic: periodic structure that can evolve""" return se_kernel(x1, x2, sigma=1.0, length_scale=3.0) * periodic_kernel(x1, x2, length_scale=0.5, period=2.0) def se_plus_periodic(x1, x2): """Smooth + periodic: long-term trend with seasonal component""" return se_kernel(x1, x2, sigma=1.0, length_scale=2.0) + periodic_kernel(x1, x2, sigma=0.5, length_scale=0.5, period=1.5) def se_plus_se(x1, x2): """Multi-scale: both local and global structure""" return se_kernel(x1, x2, sigma=0.5, length_scale=0.5) + se_kernel(x1, x2, sigma=1.0, length_scale=3.0) # Visualizationnp.random.seed(42)X = np.linspace(-5, 5, 200).reshape(-1, 1) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) combos = [ ('Linear + SE (trend + deviations)', linear_plus_se), ('SE × Periodic (evolving periodicity)', se_times_periodic), ('SE + Periodic (trend + seasonal)', se_plus_periodic), ('SE(short) + SE(long) (multi-scale)', se_plus_se),] for ax, (name, kernel_func) in zip(axes.ravel(), combos): K = build_kernel_matrix(X, kernel_func) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) for s in samples.T: ax.plot(X, s, alpha=0.7, linewidth=1.5) ax.set_title(name) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('GP Samples with Composite Kernels', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()Common Composite Kernels:
Long-term + Short-term (Multi-scale): $$k = k_{SE}(\ell_{long}) + k_{SE}(\ell_{short})$$ Captures both global trends and local wiggles.
Trend + Seasonality: $$k = k_{Linear} + k_{Periodic}$$ Common in time series: linear growth with periodic fluctuations.
Locally Periodic: $$k = k_{SE} \times k_{Periodic}$$ Periodicity that can change amplitude over time.
Non-stationary via Warping: $$k(x, x') = k_{SE}(g(x), g(x'))$$ Where $g$ is a warping function (e.g., log, sigmoid).
Additive Structure: $$k(\mathbf{x}, \mathbf{x}') = \sum_d k_d(x_d, x_d')$$ For high-dimensional inputs where the function is a sum of univariate functions.
The true power of GP prior specification lies in encoding domain knowledge. The kernel is a language for expressing what we believe about the unknown function before seeing data.
Smoothness Beliefs:
Scale Beliefs:
Structural Beliefs:
When specifying priors, resist the temptation to choose kernels that will fit the data well. The prior should encode beliefs you hold before seeing data. If you peek at the data and then choose a kernel 'because it looks like it will work,' you're doing informal model selection, not prior specification. This distinction matters for principled Bayesian inference and for honest uncertainty quantification.
| Domain Belief | Kernel Choice | Hyperparameter Setting |
|---|---|---|
| Physical processes are continuous | Matérn-5/2 or SE | Standard length scale from physics |
| Time series with daily seasonality | Periodic × SE | Period = 24 hours; SE for drift |
| Stock prices (random walk-like) | Linear kernel + noise | Variance matching historical volatility |
| Sensor noise (rough, continuous) | Matérn-3/2 | Short length scale for rapid variation |
| Spatial interpolation | SE with ARD | Separate length scales per coordinate |
| Multi-fidelity experiments | Sum/product of kernels | Different kernels for different fidelities |
A Principled Workflow for Prior Specification:
List your beliefs: Before looking at data, write down what you expect. Is the function smooth? Does it have trends? Seasonality? What's the typical magnitude?
Translate to kernels: Map each belief to a kernel component. Combine with sums (additive structure) or products (interactive structure).
Set hyperparameter ranges: Based on domain knowledge, determine plausible ranges for each hyperparameter.
Visualize prior samples: Draw samples from your GP prior. Do they look reasonable? If not, revise.
Document your choices: For reproducibility and scientific rigor, record why you made each choice.
Plan for learning: Decide which hyperparameters to estimate from data (optimize or marginalize) vs. fix from domain knowledge.
One of the most valuable practices in GP modeling is prior predictive checking: sampling functions from your prior and examining whether they look plausible before seeing any data.
This practice catches several common issues:
1. Unreasonable Smoothness:
2. Wrong Scale:
3. Missing Structure:
4. Wrong Complexity:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npimport matplotlib.pyplot as plt def comprehensive_prior_check(X, kernel_func, n_samples=20, title="Prior Check"): """ Comprehensive prior predictive check for a GP. Plots samples, mean, and credible intervals. """ n = len(X) # Build kernel matrix K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = kernel_func(X[i].item(), X[j].item()) K += 1e-8 * np.eye(n) # Sample from prior L = np.linalg.cholesky(K) samples = L @ np.random.randn(n, n_samples) # Compute statistics prior_mean = np.zeros(n) # Zero mean GP prior_std = np.sqrt(np.diag(K)) sample_percentiles = np.percentile(samples, [5, 25, 50, 75, 95], axis=1) # Create figure fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: All samples ax = axes[0, 0] for i in range(min(10, n_samples)): ax.plot(X, samples[:, i], alpha=0.5) ax.axhline(0, color='k', linestyle='--', alpha=0.3) ax.set_title('Prior Samples (10 shown)') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) # Plot 2: Mean and credible bands (from samples) ax = axes[0, 1] ax.fill_between(X.ravel(), sample_percentiles[0], sample_percentiles[4], alpha=0.15, label='90% CI', color='C0') ax.fill_between(X.ravel(), sample_percentiles[1], sample_percentiles[3], alpha=0.25, label='50% CI', color='C0') ax.plot(X, sample_percentiles[2], 'C0-', linewidth=2, label='Median') ax.plot(X, prior_mean, 'k--', linewidth=1, label='Prior Mean') ax.set_title('Prior Predictive Distribution') ax.legend() ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) # Plot 3: Histogram of function values at a single point ax = axes[1, 0] mid_idx = n // 2 ax.hist(samples[mid_idx, :], bins=20, density=True, alpha=0.7, color='C1') x_vals = np.linspace(samples[mid_idx, :].min(), samples[mid_idx, :].max(), 100) ax.plot(x_vals, np.exp(-x_vals**2 / (2 * prior_std[mid_idx]**2)) / (prior_std[mid_idx] * np.sqrt(2*np.pi)), 'k-', linewidth=2) ax.set_title(f'Marginal at x = {X[mid_idx].item():.2f}') ax.set_xlabel('f(x)') ax.set_ylabel('Density') ax.grid(True, alpha=0.3) # Plot 4: Kernel matrix visualization ax = axes[1, 1] im = ax.imshow(K, cmap='viridis', aspect='auto') plt.colorbar(im, ax=ax) ax.set_title('Kernel Matrix K') ax.set_xlabel('Input Index i') ax.set_ylabel('Input Index j') plt.suptitle(title, fontsize=14, fontweight='bold') plt.tight_layout() plt.show() # Print summary statistics print(f"\n{title} - Summary Statistics:") print(f" Prior variance at each point: {prior_std[0]:.4f}²") print(f" Sample range: [{samples.min():.3f}, {samples.max():.3f}]") print(f" 90% of samples between: [{sample_percentiles[0].min():.3f}, {sample_percentiles[4].max():.3f}]") # Example: Check different priorsnp.random.seed(42)X = np.linspace(-5, 5, 100).reshape(-1, 1) # Prior 1: Good for smooth, bounded functionsdef kernel_smooth(x1, x2): return np.exp(-(x1 - x2)**2 / (2 * 2.0**2)) # Prior 2: Good for rough, rapidly varying functions def kernel_rough(x1, x2): r = np.abs(x1 - x2) return np.exp(-r / 0.5) # Matérn-1/2 # Prior 3: Good for periodic phenomenadef kernel_seasonal(x1, x2): r = np.abs(x1 - x2) se = np.exp(-r**2 / (2 * 5**2)) per = np.exp(-2 * np.sin(np.pi * r / 2)**2 / 0.5**2) return se * per comprehensive_prior_check(X, kernel_smooth, title="Smooth Prior (SE, ℓ=2)")comprehensive_prior_check(X, kernel_rough, title="Rough Prior (Matérn-1/2, ℓ=0.5)")comprehensive_prior_check(X, kernel_seasonal, title="Seasonal Prior (SE × Periodic)")If you don't like the prior samples, you won't like the posterior either. The prior shapes the posterior, especially with limited data. Always visualize prior samples before fitting. If they look unreasonable, revise the prior before proceeding.
We've explored the rich landscape of GP prior specification, learning how to encode domain knowledge through mean functions and kernels.
What's Next:
With prior specification mastered, we're ready for the heart of GP inference: posterior inference. Given data, how do we update our beliefs about the unknown function? The next page derives the closed-form posterior distribution and shows how conditioning on observations transforms our prior into predictions.
You now understand how to specify GP priors through careful choice of mean functions and kernels. This skill—translating domain knowledge into prior distributions—is fundamental to effective Bayesian modeling. Next, we'll learn how to update these priors with data through posterior inference.