Loading learning content...
If Gaussian Processes are the engine of non-parametric Bayesian inference, then kernels are the soul that determines what kinds of functions we believe in. Every choice we make about the kernel directly translates into assumptions about function smoothness, periodicity, correlation length, and ultimate behavior.
This page focuses on stationary kernels—also called shift-invariant kernels—which represent the most widely-used class of covariance functions. Stationary kernels encode the idea that the statistical relationship between function values depends only on the distance between input points, not their absolute location. This property makes them extraordinarily versatile for modeling phenomena where we expect translation invariance.
By the end of this page, you will understand: (1) the formal definition and properties of stationary kernels, (2) the mathematical structure of the RBF, Matérn, and Periodic kernels, (3) how each kernel encodes different prior beliefs about function smoothness, (4) the spectral interpretation of stationarity, and (5) practical guidelines for selecting stationary kernels for your applications.
Before diving into specific kernels, we must establish precisely what stationarity means in the context of covariance functions. This concept is foundational to understanding why certain kernels are appropriate for certain problems.
Definition: Stationary Kernel
A kernel (covariance function) $k(\mathbf{x}, \mathbf{x}')$ is stationary if it can be expressed as a function of only the difference $\mathbf{x} - \mathbf{x}'$ rather than the absolute positions:
$$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$$
This means the covariance between any two function values $f(\mathbf{x})$ and $f(\mathbf{x}')$ depends only on the vector separating them, not where they are in input space. If we translate both points by the same amount, the covariance remains unchanged.
Definition: Isotropic Kernel
A stationary kernel is isotropic (or radial) if it depends only on the Euclidean distance $r = |\mathbf{x} - \mathbf{x}'|$:
$$k(\mathbf{x}, \mathbf{x}') = k(r) \quad \text{where } r = |\mathbf{x} - \mathbf{x}'|$$
Isotropic kernels are rotationally invariant—they treat all directions in input space identically. Most common stationary kernels are isotropic.
| Kernel Type | Depends On | Invariance Property | Example |
|---|---|---|---|
| Stationary & Isotropic | Distance $r = |\mathbf{x} - \mathbf{x}'|$ | Translation + Rotation | RBF, Matérn |
| Stationary but Anisotropic | Vector $\mathbf{x} - \mathbf{x}'$ | Translation only | ARD kernels |
| Non-stationary | Absolute positions $\mathbf{x}, \mathbf{x}'$ | Neither | Neural Network kernel, Polynomial kernel |
Stationary kernels encode the prior belief that the underlying function has the same statistical properties everywhere in the input domain. This is appropriate when there's no reason to believe that correlations or smoothness should change across the input space. For example, temperature variations across a homogeneous field, or signal processing where the noise characteristics don't change over time.
Connection to Stochastic Processes
The term 'stationary' in kernel theory directly connects to stationarity in stochastic processes. A Gaussian Process with a stationary kernel is a wide-sense stationary stochastic process, meaning:
This connection provides powerful tools for analysis. In particular, Bochner's theorem gives us a complete characterization: a continuous, stationary kernel is positive definite if and only if it is the Fourier transform of a finite, non-negative measure (the spectral measure). This spectral representation is the foundation of spectral analysis of GPs, which we'll explore in a later page.
The Positive Definiteness Requirement
Not every function of $|\mathbf{x} - \mathbf{x}'|$ is a valid kernel. A kernel must be positive semi-definite (PSD): for any set of points ${\mathbf{x}_1, ..., \mathbf{x}n}$, the resulting covariance matrix $\mathbf{K}$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semi-definite (all eigenvalues $\geq 0$).
This constraint ensures that the GP defines a valid probability distribution. Violating it leads to nonsensical 'negative variances.' Fortunately, the kernels we discuss here are all provably PSD.
The Radial Basis Function (RBF) kernel, also known as the Squared Exponential or Gaussian kernel, is the most widely used kernel in GP practice. Its ubiquity stems from its simplicity, smoothness properties, and well-understood behavior.
Mathematical Definition
$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$
Where:
Alternatively, using the squared distance $r^2 = |\mathbf{x} - \mathbf{x}'|^2$:
$$k_{\text{RBF}}(r) = \sigma^2 \exp\left(-\frac{r^2}{2\ell^2}\right)$$
The RBF kernel produces functions that are infinitely differentiable (C^∞). This is both its strength and weakness. It's perfect for modeling smooth phenomena but leads to unrealistically smooth functions for data with rough or irregular behavior. The mean-squared differentiability is directly connected to the behavior of the kernel at the origin.
Understanding the Length-Scale Parameter
The length-scale $\ell$ is perhaps the most important hyperparameter in GP modeling. It answers the question: over what distance do function values remain correlated?
Rule of Thumb: If $|\mathbf{x} - \mathbf{x}'| \gg \ell$, then $k(\mathbf{x}, \mathbf{x}') \approx 0$. Points separated by more than a few length-scales are essentially uncorrelated.
Quantitative Relationship:
| Distance | Correlation (as fraction of $\sigma^2$) |
|---|---|
| $r = 0$ | 1.000 (perfect correlation) |
| $r = \ell$ | 0.606 |
| $r = 2\ell$ | 0.135 |
| $r = 3\ell$ | 0.011 |
This exponential decay is characteristic of the RBF kernel.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npimport matplotlib.pyplot as pltfrom scipy.spatial.distance import cdist def rbf_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0): """ Compute the RBF (Squared Exponential) kernel matrix. Parameters: ----------- X1 : ndarray of shape (n1, d) First set of input points X2 : ndarray of shape (n2, d) Second set of input points sigma_sq : float Signal variance (output scale) length_scale : float Length-scale (determines correlation decay rate) Returns: -------- K : ndarray of shape (n1, n2) Kernel matrix where K[i,j] = k(X1[i], X2[j]) Mathematical Definition: k(x, x') = σ² · exp(-||x - x'||² / (2ℓ²)) """ # Compute pairwise squared Euclidean distances sq_dists = cdist(X1, X2, metric='sqeuclidean') # Apply the RBF kernel formula K = sigma_sq * np.exp(-sq_dists / (2 * length_scale**2)) return K def visualize_rbf_samples(length_scales=[0.5, 1.0, 2.0], n_samples=5): """ Visualize GP samples with different length-scales to build intuition. """ X = np.linspace(0, 10, 200).reshape(-1, 1) fig, axes = plt.subplots(1, len(length_scales), figsize=(15, 4)) for ax, ell in zip(axes, length_scales): K = rbf_kernel(X, X, sigma_sq=1.0, length_scale=ell) # Add small jitter for numerical stability K += 1e-8 * np.eye(len(X)) # Sample from the GP prior samples = np.random.multivariate_normal( mean=np.zeros(len(X)), cov=K, size=n_samples ) for sample in samples: ax.plot(X, sample, alpha=0.7) ax.set_title(f'Length-scale ℓ = {ell}') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('RBF Kernel: Effect of Length-Scale on GP Samples') plt.tight_layout() return fig # Example: Compare correlation decaydistances = np.array([0, 0.5, 1.0, 1.5, 2.0, 3.0])ell = 1.0correlations = np.exp(-distances**2 / (2 * ell**2))print("Distance vs Correlation (ℓ = 1.0):")for d, c in zip(distances, correlations): print(f" r = {d:.1f}: correlation = {c:.4f}")When to Use the RBF Kernel
✅ Appropriate for:
❌ Inappropriate for:
The Over-Smoothing Problem
A notorious issue with the RBF kernel is its tendency toward over-smoothing or over-confidence. Because it assumes infinite differentiability, the posterior predictions can be overly certain in regions between data points, failing to capture the true uncertainty when the underlying function is rough.
The Matérn kernel family addresses the RBF kernel's over-smoothness problem by providing explicit control over function differentiability. This makes it arguably the most versatile family of stationary kernels for practical applications.
Mathematical Definition
The general Matérn kernel is:
$$k_{\text{Matérn}}(r) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}, r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}, r}{\ell}\right)$$
Where:
The Smoothness Parameter ν
The parameter $\nu$ directly controls the mean-squared differentiability of the GP samples:
| ν Value | Common Name | Differentiability | Formula (Simplified) | Typical Use Case |
|---|---|---|---|---|
| ν = 1/2 | Exponential / Matérn-1/2 | C^0 (continuous, not differentiable) | $\sigma^2 \exp(-r/\ell)$ | Rough, non-smooth functions (e.g., Brownian motion in space) |
| ν = 3/2 | Matérn-3/2 | C^1 (once differentiable) | $\sigma^2 (1 + \frac{\sqrt{3}r}{\ell}) \exp(-\frac{\sqrt{3}r}{\ell})$ | Moderately smooth functions, physical processes |
| ν = 5/2 | Matérn-5/2 | C^2 (twice differentiable) | $\sigma^2 (1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}) \exp(-\frac{\sqrt{5}r}{\ell})$ | Smooth functions, most common default |
| ν → ∞ | RBF / Squared Exponential | C^∞ (infinitely differentiable) | $\sigma^2 \exp(-\frac{r^2}{2\ell^2})$ | Very smooth, analytic functions |
In practice, Matérn-5/2 has emerged as the default recommendation for many applications. It's smooth enough for typical physical processes (twice differentiable) while avoiding the over-confidence issues of the RBF kernel. Many practitioners and libraries (like GPyTorch, GPflow) use Matérn-5/2 as their default kernel.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom scipy.special import gamma, kv as bessel_kfrom scipy.spatial.distance import cdist def matern_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0, nu=2.5): """ Compute the Matérn kernel matrix. Parameters: ----------- X1 : ndarray of shape (n1, d) X2 : ndarray of shape (n2, d) sigma_sq : float Signal variance length_scale : float Length-scale parameter nu : float Smoothness parameter (common: 0.5, 1.5, 2.5) Returns: -------- K : ndarray of shape (n1, n2) """ # Compute pairwise Euclidean distances dists = cdist(X1, X2, metric='euclidean') # Handle special cases efficiently if nu == 0.5: # Matérn-1/2: Exponential kernel return sigma_sq * np.exp(-dists / length_scale) elif nu == 1.5: # Matérn-3/2 sqrt3_r_over_ell = np.sqrt(3) * dists / length_scale return sigma_sq * (1 + sqrt3_r_over_ell) * np.exp(-sqrt3_r_over_ell) elif nu == 2.5: # Matérn-5/2 sqrt5_r_over_ell = np.sqrt(5) * dists / length_scale return sigma_sq * (1 + sqrt5_r_over_ell + (sqrt5_r_over_ell**2) / 3) * np.exp(-sqrt5_r_over_ell) else: # General case using Bessel function scaled = np.sqrt(2 * nu) * dists / length_scale # Handle r=0 separately (limit is 1) K = np.ones_like(dists) * sigma_sq nonzero = dists > 1e-10 K[nonzero] = sigma_sq * ( (2**(1-nu)) / gamma(nu) * (scaled[nonzero]**nu) * bessel_k(nu, scaled[nonzero]) ) return K def compare_matern_smoothness(): """ Visualize how the smoothness parameter affects GP samples. This demonstrates the practical difference between ν values. """ import matplotlib.pyplot as plt X = np.linspace(0, 10, 300).reshape(-1, 1) nu_values = [0.5, 1.5, 2.5, 5.0] fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() np.random.seed(42) # For reproducibility for ax, nu in zip(axes, nu_values): K = matern_kernel(X, X, sigma_sq=1.0, length_scale=1.0, nu=nu) K += 1e-6 * np.eye(len(X)) # Numerical stability # Sample from GP prior L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) for i in range(samples.shape[1]): ax.plot(X, samples[:, i], alpha=0.7) diff_order = int(np.ceil(nu) - 1) if nu < 10 else '∞' ax.set_title(f'Matérn-{nu} (C^{diff_order} smoothness)') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('Matérn Kernel: Effect of Smoothness Parameter ν', fontsize=14) plt.tight_layout() return figChoosing the Smoothness Parameter
Unlike the length-scale, which is typically learned from data, the smoothness parameter $\nu$ often requires domain knowledge or careful consideration:
Physical constraints: If derivatives of the modeled function have physical meaning (velocity, acceleration), choose $\nu$ to ensure they exist
Visual inspection: Examine your data—does it appear smooth or rough? Sharp transitions suggest lower $\nu$
Model comparison: Use marginal likelihood to compare different $\nu$ values
Default to Matérn-5/2: When uncertain, this provides a good balance of smoothness and flexibility
Why Not Learn ν?
While $\nu$ can be treated as a hyperparameter to optimize, this is often problematic:
For these reasons, practitioners usually fix $\nu$ to one of the special cases (0.5, 1.5, 2.5) and learn only $\sigma^2$ and $\ell$.
Many real-world phenomena exhibit periodicity: daily temperature cycles, seasonal patterns in sales data, weekly traffic patterns, or annual biological rhythms. The periodic kernel encodes this structure directly into the GP prior.
Mathematical Definition
$$k_{\text{Periodic}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{2\sin^2\left(\pi |\mathbf{x} - \mathbf{x}'| / p\right)}{\ell^2}\right)$$
Where:
Key Properties
Exact periodicity: $k(x, x') = k(x+p, x'+p)$ — the covariance structure repeats exactly every $p$ units
Cross-period correlation: Points separated by exactly one period ($|x - x'| = p$) have the same correlation as points at the same location ($|x - x'| = 0$)
Infinite repetition: The kernel assumes the periodic pattern continues forever in both directions
The standard periodic kernel assumes perfect, unchanging periodicity. Real data often has periods that evolve over time (e.g., lengthening days across months) or decay in amplitude. For such cases, consider multiplying the periodic kernel by an RBF kernel (creating a 'locally periodic' kernel) or using specialized quasi-periodic kernels.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as npfrom scipy.spatial.distance import cdist def periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale=1.0): """ Compute the Periodic (ExpSineSquared) kernel matrix. Parameters: ----------- X1 : ndarray of shape (n1, d) X2 : ndarray of shape (n2, d) sigma_sq : float Signal variance period : float Period of the repeating pattern length_scale : float Length-scale (controls smoothness within period) Returns: -------- K : ndarray of shape (n1, n2) """ # For 1D data, compute distance directly if X1.ndim == 1: X1 = X1.reshape(-1, 1) if X2.ndim == 1: X2 = X2.reshape(-1, 1) # Compute pairwise distances (for 1D, this is just |x - x'|) dists = cdist(X1, X2, metric='euclidean') # Apply periodic kernel formula sin_term = np.sin(np.pi * dists / period) K = sigma_sq * np.exp(-2 * (sin_term / length_scale)**2) return K def locally_periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale_periodic=1.0, length_scale_decay=10.0): """ Locally Periodic Kernel: Periodic × RBF This models patterns that are approximately periodic but can change smoothly over time (e.g., amplitude decay, period drift). k(x,x') = k_periodic(x,x') × k_rbf(x,x') """ # Periodic component dists = cdist(X1, X2, metric='euclidean') sin_term = np.sin(np.pi * dists / period) K_periodic = np.exp(-2 * (sin_term / length_scale_periodic)**2) # RBF decay component sq_dists = cdist(X1, X2, metric='sqeuclidean') K_rbf = np.exp(-sq_dists / (2 * length_scale_decay**2)) # Combined locally-periodic kernel return sigma_sq * K_periodic * K_rbf # Demonstration: Modeling daily temperature patternsdef demo_periodic_modeling(): import matplotlib.pyplot as plt # Generate synthetic temperature data with daily periodicity np.random.seed(42) hours = np.linspace(0, 72, 73) # 3 days, hourly # True underlying function: daily cycle with some noise true_temp = 20 + 8 * np.sin(2 * np.pi * hours / 24 - np.pi/2) # Peak at 2PM observed = true_temp + np.random.randn(len(hours)) * 1.5 # Sample observed points (missing some hours) train_idx = np.random.choice(len(hours), 40, replace=False) train_idx.sort() X_train = hours[train_idx].reshape(-1, 1) y_train = observed[train_idx] # Predict on full grid X_test = np.linspace(0, 72, 200).reshape(-1, 1) # Build kernel with known period (24 hours) K_train = periodic_kernel(X_train, X_train, sigma_sq=64.0, period=24.0, length_scale=3.0) K_train += 2.25 * np.eye(len(X_train)) # Noise variance K_test_train = periodic_kernel(X_test, X_train, sigma_sq=64.0, period=24.0, length_scale=3.0) K_test = periodic_kernel(X_test, X_test, sigma_sq=64.0, period=24.0, length_scale=3.0) # GP posterior L = np.linalg.cholesky(K_train) alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train)) mean = K_test_train @ alpha v = np.linalg.solve(L, K_test_train.T) var = np.diag(K_test) - np.sum(v**2, axis=0) std = np.sqrt(var) # Plot fig, ax = plt.subplots(figsize=(12, 5)) ax.fill_between(X_test.flatten(), mean - 2*std, mean + 2*std, alpha=0.3, label='95% CI') ax.plot(X_test, mean, 'b-', label='GP Posterior Mean') ax.plot(hours, true_temp, 'g--', alpha=0.7, label='True Function') ax.scatter(X_train, y_train, c='red', s=30, zorder=5, label='Observations') ax.set_xlabel('Hours') ax.set_ylabel('Temperature (°C)') ax.set_title('Periodic Kernel: Modeling Daily Temperature Cycles') ax.legend() ax.grid(True, alpha=0.3) return figUnderstanding the Period and Length-Scale Interaction
The periodic kernel has a subtle interplay between its parameters:
Period $p$: Determines where the pattern repeats. This is often known from domain knowledge (24 hours for daily, 365.25 days for yearly)
Length-scale $\ell$: Controls how smoothly the pattern varies within each period. Smaller $\ell$ allows more wiggly variations within the period; larger $\ell$ produces smoother within-period behavior
When $\ell \ll p$: The function can vary rapidly within each period When $\ell \gg p$: The function becomes nearly sinusoidal (only the fundamental frequency matters)
Extensions and Variations
Locally Periodic: $k_{\text{LP}} = k_{\text{Periodic}} \times k_{\text{RBF}}$
Quasi-Periodic: Period itself varies smoothly
Multiple Periods: Sum of periodic kernels with different periods
Beyond the RBF, Matérn, and Periodic kernels, several other stationary kernels serve specific purposes. Understanding their characteristics helps you select the right tool for specialized applications.
| Kernel | Smoothness | Parameters | Key Feature | Computational Cost |
|---|---|---|---|---|
| RBF | C^∞ | σ², ℓ | Infinite smoothness | Low |
| Matérn-5/2 | C^2 | σ², ℓ | Controllable smoothness | Low |
| Matérn-3/2 | C^1 | σ², ℓ | Once differentiable | Low |
| Matérn-1/2 | C^0 | σ², ℓ | Continuous only (rough) | Low |
| Periodic | C^∞ | σ², ℓ, p | Exact periodicity | Low |
| Rational Quadratic | C^∞ | σ², ℓ, α | Multi-scale smoothness | Low |
| Spectral Mixture | Varies | Many | Extreme flexibility | Medium-High |
The Rational Quadratic Kernel
The Rational Quadratic (RQ) kernel deserves special attention due to its unique interpretation:
$$k_{\text{RQ}}(r) = \sigma^2 \left(1 + \frac{r^2}{2\alpha\ell^2}\right)^{-\alpha}$$
The parameter $\alpha > 0$ controls how the different length-scales are mixed:
Mathematical Insight: The RQ kernel can be derived as a scale mixture of RBF kernels where the length-scale follows an inverse-gamma distribution. This makes it ideal for functions that exhibit multi-scale behavior—smooth in some regions, with finer variations in others.
Practical Guidance:
Selecting the right kernel is both an art and a science. While hyperparameters can be learned from data, the kernel family encodes structural assumptions that must often come from domain knowledge or exploratory analysis.
A Systematic Selection Process
✓ Homogeneous domains (no spatial variation in properties) ✓ Temporal data without trend or drift ✓ Functions where smoothness is constant ✓ As building blocks for more complex kernels
✗ Functions with varying smoothness ✗ Data with trends or drifts ✗ Boundaries with different behavior ✗ Multi-fidelity or hierarchical data
Common Pitfalls and How to Avoid Them
Over-Reliance on RBF: The RBF kernel is the 'Gaussian distribution of kernels'—simple and often good enough, but its infinite smoothness assumption is rarely accurate. Default to Matérn-5/2 instead.
Ignoring Periodicity: Failing to model known periodic structure leads to poor predictions. If you know the period, encode it!
Too Many Parameters: Complex kernels with many hyperparameters can overfit or be hard to optimize. Start simple and add complexity incrementally.
Forgetting Scaling: Input features should typically be normalized. A length-scale of 1.0 means different things for features ranging [0,1] vs [0,1000].
Not Visualizing Samples: Before fitting, sample from the prior to check if the kernel produces reasonable-looking functions. If prior samples don't resemble plausible data, reconsider your kernel choice.
Stationary kernels form the foundation of most Gaussian Process applications. Their shift-invariance property—that correlations depend only on distances, not absolute positions—makes them versatile and interpretable.
What's Next:
Stationary kernels assume uniform properties across the input space. In the next page, we'll explore non-stationary kernels that can model functions whose characteristics vary with location—essential for handling trends, boundaries, and heterogeneous domains.
You now understand stationary kernels: their mathematical foundations, the major examples (RBF, Matérn, Periodic), and when to apply each. This knowledge forms the vocabulary for discussing GP priors and prepares you for the greater flexibility of non-stationary and composite kernels.