Loading learning content...
In Gaussian Process regression, the kernel function (also called the covariance function) is arguably the single most important modeling decision you will make. While the GP framework provides the probabilistic machinery for inference and uncertainty quantification, it is the kernel that encodes your prior beliefs about the function you are trying to learn.
Think of the kernel as the language you use to communicate with the learning algorithm. Through the kernel, you tell the GP:
By the end of this page, you will understand the mathematical requirements for valid kernels, the key properties that kernels encode, the most important kernel families used in practice, and the principles for designing kernels that capture your domain knowledge. You will see how a well-chosen kernel can make the difference between a GP that captures the true underlying function and one that fails entirely.
A kernel function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ defines the covariance between function values at any two input points. For a Gaussian Process $f \sim \mathcal{GP}(m, k)$ with mean function $m$ and kernel $k$:
$$\text{Cov}(f(\mathbf{x}), f(\mathbf{x}')) = k(\mathbf{x}, \mathbf{x}')$$
This seemingly simple definition carries profound implications. The kernel completely determines the prior distribution over functions before any data is observed. But not every function $k(\mathbf{x}, \mathbf{x}')$ qualifies as a valid kernel.
For a function to be a valid kernel, it must be positive semi-definite (PSD). This means that for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n}$, the resulting kernel matrix $K$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semi-definite: $\mathbf{v}^T K \mathbf{v} \geq 0$ for all vectors $\mathbf{v} \in \mathbb{R}^n$.
Why is positive semi-definiteness required?
The fundamental reason is that a kernel matrix $K$ represents a covariance matrix of a multivariate Gaussian distribution. For a distribution $\mathbf{f} \sim \mathcal{N}(\mathbf{m}, K)$ to be well-defined:
All these requirements are satisfied if and only if $K$ is positive semi-definite.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom numpy.linalg import eigvalsh def is_positive_semidefinite(K: np.ndarray, tol: float = 1e-10) -> bool: """ Check if a matrix K is positive semi-definite. A matrix is PSD if and only if all its eigenvalues are non-negative. We use a small tolerance for numerical stability. Parameters: ----------- K : np.ndarray Square matrix to check (should be symmetric) tol : float Tolerance for considering eigenvalues as non-negative Returns: -------- bool : True if K is positive semi-definite """ # First check symmetry (kernel matrices should be symmetric) if not np.allclose(K, K.T, atol=tol): print("Warning: Matrix is not symmetric") return False # Compute eigenvalues (eigvalsh is for symmetric matrices) eigenvalues = eigvalsh(K) # Check if all eigenvalues are non-negative (within tolerance) min_eigenvalue = np.min(eigenvalues) if min_eigenvalue < -tol: print(f"Minimum eigenvalue: {min_eigenvalue:.2e} - NOT PSD") return False print(f"Minimum eigenvalue: {min_eigenvalue:.2e} - Matrix is PSD") return True def demonstrate_psd_check(): """Demonstrate PSD checking with various matrices.""" # Example 1: Valid kernel matrix (RBF kernel) X = np.array([[0], [1], [2], [3]]) length_scale = 1.0 # RBF kernel: k(x, x') = exp(-||x - x'||^2 / (2 * l^2)) K_rbf = np.exp(-0.5 * np.sum((X[:, None] - X[None, :]) ** 2, axis=2) / length_scale**2) print("RBF Kernel Matrix:") print(K_rbf.round(4)) is_positive_semidefinite(K_rbf) # Should be True print("\n" + "="*50 + "\n") # Example 2: Invalid "kernel" - not PSD # This is NOT a valid kernel function! K_invalid = np.array([ [1.0, 0.9, 0.9], [0.9, 1.0, -0.9], # Negative correlation [0.9, -0.9, 1.0] ]) print("Invalid 'Kernel' Matrix (not actually a kernel):") print(K_invalid) is_positive_semidefinite(K_invalid) # Should be False if __name__ == "__main__": demonstrate_psd_check()Mercer's Theorem: The Foundation of Kernels
Mercer's theorem provides the theoretical foundation for understanding kernels. It states that any continuous, symmetric, positive semi-definite function $k(\mathbf{x}, \mathbf{x}')$ can be expressed as an inner product in some (possibly infinite-dimensional) Hilbert space $\mathcal{H}$:
$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$
where $\phi: \mathcal{X} \rightarrow \mathcal{H}$ is a feature mapping. This is profound: every valid kernel implicitly defines a feature space, even if that feature space is infinite-dimensional.
For Gaussian Processes, this means:
Different kernels encode different assumptions about the functions they generate. Understanding these properties is essential for informed kernel selection.
| Property | Mathematical Characterization | Effect on Generated Functions |
|---|---|---|
| Stationarity | $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$ | Translation invariance; same behavior everywhere |
| Isotropy | $k(\mathbf{x}, \mathbf{x}') = k(|\mathbf{x} - \mathbf{x}'|)$ | Rotation invariance; no preferred direction |
| Smoothness | Differentiability of $k$ at origin | Differentiability of sample paths |
| Length Scale $\ell$ | Rate of decay of $k$ with $|\mathbf{x} - \mathbf{x}'|$ | Characteristic variation scale |
| Marginal Variance $\sigma^2$ | $k(\mathbf{x}, \mathbf{x}) = \sigma^2$ | Typical amplitude of deviations |
| Periodicity $p$ | $k(\mathbf{x} + p, \mathbf{x}' + p) = k(\mathbf{x}, \mathbf{x}')$ | Repeating patterns with period $p$ |
The Smoothness-Flexibility Tradeoff
There is a fundamental tradeoff between smoothness and flexibility. Smoother kernels (like RBF) produce elegant, continuous functions but may miss sharp transitions in the data. Rougher kernels (like Matérn-1/2) can capture discontinuities but may overfit to noise.
The right choice depends on your domain knowledge:
The Squared Exponential kernel, also known as the Radial Basis Function (RBF) kernel or the Gaussian kernel, is the most widely used kernel in Gaussian Process regression. Its popularity stems from its mathematical tractability and the smooth functions it generates.
$$k_{SE}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$
where:
The length scale $\ell$ is measured in the same units as your input features. If $\ell = 10$ and your inputs are in meters, then points more than about 30 meters apart (three length scales) will have nearly independent function values. Points within one length scale are strongly correlated.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npimport matplotlib.pyplot as pltfrom scipy.spatial.distance import cdist class RBFKernel: """ Squared Exponential (RBF) Kernel Implementation. The RBF kernel produces infinitely differentiable sample functions. It is the most commonly used kernel in Gaussian Process regression due to its mathematical simplicity and smooth function generation. Parameters: ----------- length_scale : float The characteristic length scale. Larger values produce smoother, more slowly varying functions. signal_variance : float The signal variance σ². Controls the overall amplitude of the function variations. """ def __init__(self, length_scale: float = 1.0, signal_variance: float = 1.0): self.length_scale = length_scale self.signal_variance = signal_variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """ Compute the kernel matrix between X1 and X2. Parameters: ----------- X1 : np.ndarray of shape (n1, d) First set of input points X2 : np.ndarray of shape (n2, d) Second set of input points Returns: -------- K : np.ndarray of shape (n1, n2) Kernel matrix where K[i,j] = k(X1[i], X2[j]) """ # Compute squared Euclidean distances sq_dists = cdist(X1, X2, metric='sqeuclidean') # Apply RBF formula K = self.signal_variance * np.exp(-sq_dists / (2 * self.length_scale ** 2)) return K def diagonal(self, X: np.ndarray) -> np.ndarray: """ Compute only the diagonal elements k(x_i, x_i). For the RBF kernel, all diagonal elements equal signal_variance. This is useful for computing variances without full matrix computation. """ return np.full(X.shape[0], self.signal_variance) def gradients(self, X1: np.ndarray, X2: np.ndarray) -> dict: """ Compute gradients of the kernel with respect to hyperparameters. These gradients are essential for hyperparameter optimization via gradient descent on the log marginal likelihood. Returns: -------- dict : Dictionary containing: - 'length_scale': ∂K/∂ℓ - 'signal_variance': ∂K/∂σ² """ K = self(X1, X2) sq_dists = cdist(X1, X2, metric='sqeuclidean') # Gradient w.r.t. length scale # ∂K/∂ℓ = K * (||x - x'||² / ℓ³) dK_dl = K * sq_dists / (self.length_scale ** 3) # Gradient w.r.t. signal variance # ∂K/∂σ² = K / σ² dK_dsv = K / self.signal_variance return { 'length_scale': dK_dl, 'signal_variance': dK_dsv } def visualize_rbf_effect(): """Visualize the effect of different length scales on GP samples.""" np.random.seed(42) # Test points X_test = np.linspace(0, 10, 200).reshape(-1, 1) # Different length scales length_scales = [0.5, 1.0, 2.0, 5.0] fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() for ax, ls in zip(axes, length_scales): kernel = RBFKernel(length_scale=ls, signal_variance=1.0) K = kernel(X_test, X_test) # Add small jitter for numerical stability K += 1e-8 * np.eye(len(X_test)) # Draw 3 samples from the GP prior L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X_test), 3) for i in range(3): ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5) ax.set_title(f'Length Scale ℓ = {ls}', fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) ax.set_ylim(-3, 3) plt.suptitle('RBF Kernel: Effect of Length Scale on GP Samples', fontsize=14) plt.tight_layout() plt.savefig('rbf_length_scale_effect.png', dpi=150, bbox_inches='tight') plt.show() if __name__ == "__main__": visualize_rbf_effect()Mathematical Properties of the RBF Kernel
Infinite differentiability: Sample functions from an RBF-kernel GP are infinitely differentiable (smooth). This can be both a strength (no jagged edges) and a weakness (may not capture discontinuities).
Universal approximator: The RBF kernel is universal, meaning it can approximate any continuous function on a compact set to arbitrary precision (given enough data).
Spectral density: The Fourier transform of the RBF kernel is also Gaussian, with spectral density: $$S(\omega) \propto \exp\left(-\frac{\ell^2 |\omega|^2}{2}\right)$$ This means functions drawn from an RBF kernel have most of their power at low frequencies.
Eigenfunction decomposition: On a bounded domain, the eigenfunctions of the RBF kernel are sine and cosine functions, with eigenvalues decaying rapidly.
The infinite smoothness of the RBF kernel can be a liability. Real-world data often contains jumps, kinks, or rough textures that the RBF kernel cannot capture well. If your predictions look 'too smooth' compared to reality, consider the Matérn family instead.
The Matérn kernel family is arguably the most important kernel family for practical Gaussian Process applications. Unlike the RBF kernel, Matérn kernels allow you to control the smoothness of the generated functions through a single parameter $\nu$.
$$k_{Matérn}(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{\ell}\right)$$
where:
The Smoothness Parameter $\nu$
The parameter $\nu$ determines how many times the function is differentiable:
In practice, three special cases are used almost exclusively because they have simple closed forms:
Matérn-1/2 (Exponential Kernel): $$k_{1/2}(r) = \sigma_f^2 \exp\left(-\frac{r}{\ell}\right)$$ Produces rough, non-differentiable functions (continuous but nowhere differentiable, like Brownian motion).
Matérn-3/2: $$k_{3/2}(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3}r}{\ell}\right) \exp\left(-\frac{\sqrt{3}r}{\ell}\right)$$ Produces functions that are once differentiable but not twice. This is often recommended as a default.
Matérn-5/2: $$k_{5/2}(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right)$$ Produces functions that are twice differentiable. Smooth but not infinitely so.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom scipy.spatial.distance import cdist class MaternKernel: """ Matérn Kernel Family Implementation. The Matérn kernels provide a continuous parameterization of smoothness. In practice, we use ν ∈ {1/2, 3/2, 5/2} for computational efficiency. Parameters: ----------- nu : float Smoothness parameter. Must be one of [0.5, 1.5, 2.5] length_scale : float Characteristic length scale signal_variance : float Signal variance σ² """ def __init__(self, nu: float = 2.5, length_scale: float = 1.0, signal_variance: float = 1.0): if nu not in [0.5, 1.5, 2.5]: raise ValueError("nu must be 0.5, 1.5, or 2.5 for efficient computation") self.nu = nu self.length_scale = length_scale self.signal_variance = signal_variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """Compute the Matérn kernel matrix.""" # Compute Euclidean distances dists = cdist(X1, X2, metric='euclidean') if self.nu == 0.5: # Matérn-1/2 (Exponential kernel) K = self._matern_1_2(dists) elif self.nu == 1.5: # Matérn-3/2 K = self._matern_3_2(dists) elif self.nu == 2.5: # Matérn-5/2 K = self._matern_5_2(dists) return self.signal_variance * K def _matern_1_2(self, dists: np.ndarray) -> np.ndarray: """ Matérn-1/2: k(r) = exp(-r/ℓ) This is equivalent to the Ornstein-Uhlenbeck process. Produces continuous but not differentiable functions. """ return np.exp(-dists / self.length_scale) def _matern_3_2(self, dists: np.ndarray) -> np.ndarray: """ Matérn-3/2: k(r) = (1 + √3r/ℓ) * exp(-√3r/ℓ) Produces once-differentiable functions. A good default choice for many applications. """ sqrt3_r_over_l = np.sqrt(3) * dists / self.length_scale return (1 + sqrt3_r_over_l) * np.exp(-sqrt3_r_over_l) def _matern_5_2(self, dists: np.ndarray) -> np.ndarray: """ Matérn-5/2: k(r) = (1 + √5r/ℓ + 5r²/(3ℓ²)) * exp(-√5r/ℓ) Produces twice-differentiable functions. Good for modeling physical systems. """ sqrt5_r_over_l = np.sqrt(5) * dists / self.length_scale return (1 + sqrt5_r_over_l + 5 * dists**2 / (3 * self.length_scale**2)) * \ np.exp(-sqrt5_r_over_l) def compare_matern_smoothness(): """Visualize the effect of smoothness parameter on function samples.""" import matplotlib.pyplot as plt np.random.seed(42) X_test = np.linspace(0, 10, 200).reshape(-1, 1) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) for ax, nu in zip(axes, [0.5, 1.5, 2.5]): kernel = MaternKernel(nu=nu, length_scale=1.0, signal_variance=1.0) K = kernel(X_test, X_test) K += 1e-6 * np.eye(len(X_test)) # Numerical stability L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X_test), 3) for i in range(3): ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5) ax.set_title(f'Matérn-ν={nu}', fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('Matérn Kernel: Effect of Smoothness Parameter ν', fontsize=14) plt.tight_layout() plt.savefig('matern_smoothness_comparison.png', dpi=150, bbox_inches='tight') plt.show() if __name__ == "__main__": compare_matern_smoothness()Many practitioners recommend Matérn-3/2 as the default kernel instead of RBF. It provides a good balance between smoothness and flexibility, can capture moderate roughness in data, and avoids the pathologically smooth behavior of the RBF kernel. Carl Rasmussen (author of the canonical GP textbook) often advocates for this choice.
Many real-world phenomena exhibit periodic patterns: daily temperature cycles, weekly sales patterns, annual seasonality, heartbeat rhythms. To capture these patterns, we need kernels that encode periodicity.
The Standard Periodic Kernel
$$k_{Per}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right)$$
where:
This kernel produces functions that are exactly periodic with period $p$. The function value at $x$ equals the function value at $x + p$ (up to random variation).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom scipy.spatial.distance import cdist class PeriodicKernel: """ Periodic Kernel for modeling cyclical phenomena. This kernel produces functions that repeat with a specified period. Useful for: seasonal effects, daily/weekly patterns, physical oscillations. Parameters: ----------- period : float The period of the function (in input units) length_scale : float Controls smoothness within each period signal_variance : float Controls amplitude of variations """ def __init__(self, period: float = 1.0, length_scale: float = 1.0, signal_variance: float = 1.0): self.period = period self.length_scale = length_scale self.signal_variance = signal_variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """Compute the periodic kernel matrix.""" # Compute pairwise differences if X1.ndim == 1: X1 = X1.reshape(-1, 1) if X2.ndim == 1: X2 = X2.reshape(-1, 1) # For 1D inputs, compute |x - x'| diffs = cdist(X1, X2, metric='euclidean') # Periodic kernel formula arg = np.pi * diffs / self.period K = self.signal_variance * np.exp( -2 * np.sin(arg)**2 / self.length_scale**2 ) return K class LocallyPeriodicKernel: """ Locally Periodic Kernel: periodicity that decays with distance. This is a product of the Periodic kernel and the RBF kernel: k_lp = k_per × k_rbf Produces functions that are approximately periodic but allow the pattern to vary slowly over time (e.g., seasonal patterns in climate that drift over decades). """ def __init__(self, period: float = 1.0, periodic_ls: float = 1.0, decay_ls: float = 10.0, signal_variance: float = 1.0): self.period = period self.periodic_ls = periodic_ls self.decay_ls = decay_ls self.signal_variance = signal_variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """Compute the locally periodic kernel matrix.""" if X1.ndim == 1: X1 = X1.reshape(-1, 1) if X2.ndim == 1: X2 = X2.reshape(-1, 1) diffs = cdist(X1, X2, metric='euclidean') sq_diffs = diffs ** 2 # Periodic component arg = np.pi * diffs / self.period K_per = np.exp(-2 * np.sin(arg)**2 / self.periodic_ls**2) # RBF decay component K_rbf = np.exp(-sq_diffs / (2 * self.decay_ls**2)) return self.signal_variance * K_per * K_rbf def demonstrate_periodic_kernels(): """Visualize samples from periodic kernels.""" import matplotlib.pyplot as plt np.random.seed(42) X_test = np.linspace(0, 10, 300).reshape(-1, 1) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Pure periodic kernel kernel1 = PeriodicKernel(period=2.0, length_scale=0.5) K1 = kernel1(X_test, X_test) + 1e-6 * np.eye(len(X_test)) L1 = np.linalg.cholesky(K1) samples1 = L1 @ np.random.randn(len(X_test), 2) axes[0, 0].plot(X_test, samples1, linewidth=1.5) axes[0, 0].set_title('Pure Periodic Kernel (period=2)', fontsize=12) axes[0, 0].axvline(x=2, color='gray', linestyle='--', alpha=0.5) axes[0, 0].axvline(x=4, color='gray', linestyle='--', alpha=0.5) axes[0, 0].axvline(x=6, color='gray', linestyle='--', alpha=0.5) axes[0, 0].axvline(x=8, color='gray', linestyle='--', alpha=0.5) axes[0, 0].set_xlabel('x') axes[0, 0].set_ylabel('f(x)') axes[0, 0].grid(True, alpha=0.3) # Locally periodic kernel kernel2 = LocallyPeriodicKernel(period=2.0, periodic_ls=0.5, decay_ls=4.0) K2 = kernel2(X_test, X_test) + 1e-6 * np.eye(len(X_test)) L2 = np.linalg.cholesky(K2) samples2 = L2 @ np.random.randn(len(X_test), 2) axes[0, 1].plot(X_test, samples2, linewidth=1.5) axes[0, 1].set_title('Locally Periodic Kernel (pattern decays)', fontsize=12) axes[0, 1].set_xlabel('x') axes[0, 1].set_ylabel('f(x)') axes[0, 1].grid(True, alpha=0.3) # Effect of period parameter for period in [1.0, 3.0]: kernel = PeriodicKernel(period=period, length_scale=0.5) K = kernel(X_test, X_test) + 1e-6 * np.eye(len(X_test)) L = np.linalg.cholesky(K) sample = (L @ np.random.randn(len(X_test), 1)).flatten() axes[1, 0].plot(X_test, sample, linewidth=1.5, label=f'period={period}') axes[1, 0].set_title('Effect of Period Parameter', fontsize=12) axes[1, 0].set_xlabel('x') axes[1, 0].set_ylabel('f(x)') axes[1, 0].legend() axes[1, 0].grid(True, alpha=0.3) # Effect of length scale in periodic kernel for ls in [0.3, 0.7, 1.5]: kernel = PeriodicKernel(period=2.0, length_scale=ls) K = kernel(X_test, X_test) + 1e-6 * np.eye(len(X_test)) L = np.linalg.cholesky(K) sample = (L @ np.random.randn(len(X_test), 1)).flatten() axes[1, 1].plot(X_test, sample, linewidth=1.5, label=f'length_scale={ls}') axes[1, 1].set_title('Effect of Length Scale (within period)', fontsize=12) axes[1, 1].set_xlabel('x') axes[1, 1].set_ylabel('f(x)') axes[1, 1].legend() axes[1, 1].grid(True, alpha=0.3) plt.tight_layout() plt.savefig('periodic_kernels.png', dpi=150, bbox_inches='tight') plt.show() if __name__ == "__main__": demonstrate_periodic_kernels()Non-Stationary Kernels
All the kernels we've discussed so far are stationary—they depend only on the difference between inputs, not their absolute positions. But many real-world problems exhibit non-stationary behavior:
Common approaches to non-stationarity include:
While stationary kernels like RBF and Matérn are widely used, polynomial kernels offer a different perspective. They are non-stationary and can capture global trends and polynomial relationships in the data.
The Linear Kernel
$$k_{Lin}(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - c)(\mathbf{x}' - c)$$
where:
The linear kernel produces linear functions (straight lines in 1D, hyperplanes in higher dimensions). It's equivalent to Bayesian linear regression.
The Polynomial Kernel
$$k_{Poly}(\mathbf{x}, \mathbf{x}') = (\sigma^2 + \mathbf{x}^T\mathbf{x}')^d$$
where $d$ is the polynomial degree. This produces functions that are polynomials of degree $d$.
Many classical machine learning models can be viewed as GPs with specific kernels: • Linear regression → GP with linear kernel • Polynomial regression → GP with polynomial kernel • Neural networks (infinite width) → GP with specific kernels (Neal, 1996)
This connection provides a unified framework for understanding different models.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as np class LinearKernel: """ Linear Kernel: produces linear (straight-line) functions. This kernel is equivalent to Bayesian linear regression. Combining it with other kernels (e.g., RBF) produces functions with linear trends plus local variations. Parameters: ----------- variance : float The signal variance controlling the slope magnitude offset : float Constant offset (c parameter, typically 0) """ def __init__(self, variance: float = 1.0, offset: float = 0.0): self.variance = variance self.offset = offset def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """Compute the linear kernel matrix.""" if X1.ndim == 1: X1 = X1.reshape(-1, 1) if X2.ndim == 1: X2 = X2.reshape(-1, 1) # Center the data X1_centered = X1 - self.offset X2_centered = X2 - self.offset # Linear kernel is just the dot product K = self.variance * (X1_centered @ X2_centered.T) return K class PolynomialKernel: """ Polynomial Kernel: produces polynomial functions of degree d. Parameters: ----------- degree : int The polynomial degree scale : float Scale parameter σ² offset : float Offset parameter (often 1.0) """ def __init__(self, degree: int = 2, scale: float = 1.0, offset: float = 1.0): self.degree = degree self.scale = scale self.offset = offset def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """Compute the polynomial kernel matrix.""" if X1.ndim == 1: X1 = X1.reshape(-1, 1) if X2.ndim == 1: X2 = X2.reshape(-1, 1) # (σ² + x·x')^d dot_product = X1 @ X2.T K = (self.scale + dot_product) ** self.degree return K def compare_polynomial_degrees(): """Visualize samples from polynomial kernels of different degrees.""" import matplotlib.pyplot as plt np.random.seed(42) X_test = np.linspace(-2, 2, 200).reshape(-1, 1) fig, axes = plt.subplots(2, 2, figsize=(12, 10)) titles = ['Linear (d=1)', 'Quadratic (d=2)', 'Cubic (d=3)', 'Quartic (d=4)'] for ax, degree, title in zip(axes.flatten(), [1, 2, 3, 4], titles): kernel = PolynomialKernel(degree=degree, scale=1.0) K = kernel(X_test, X_test) # Add jitter for stability K += 1e-4 * np.eye(len(X_test)) # Cholesky may fail for high-degree polynomials due to numerical issues try: L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X_test), 3) except np.linalg.LinAlgError: # Use eigendecomposition instead eigvals, eigvecs = np.linalg.eigh(K) eigvals = np.maximum(eigvals, 1e-10) samples = eigvecs @ np.diag(np.sqrt(eigvals)) @ np.random.randn(len(X_test), 3) for i in range(3): ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5) ax.set_title(f'Polynomial Kernel ({title})', fontsize=12) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('Polynomial Kernel: Effect of Degree on Function Shape', fontsize=14) plt.tight_layout() plt.savefig('polynomial_kernel_degrees.png', dpi=150, bbox_inches='tight') plt.show() if __name__ == "__main__": compare_polynomial_degrees()The Rational Quadratic (RQ) kernel can be viewed as a scale mixture of RBF kernels with different length scales. This makes it useful when you're uncertain about the correct length scale, or when the function exhibits variation at multiple scales simultaneously.
$$k_{RQ}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \left(1 + \frac{|\mathbf{x} - \mathbf{x}'|^2}{2\alpha\ell^2}\right)^{-\alpha}$$
where:
The parameter α controls the relative weighting of different length scales: • α → ∞: The RQ kernel converges to the RBF kernel • Small α: The kernel gives more weight to larger-scale correlations • Large α: The kernel behaves more like the standard RBF
This makes the RQ kernel more flexible than the pure RBF kernel while maintaining smoothness.
Mathematical Insight: The Scale Mixture Interpretation
The RQ kernel can be derived as an infinite mixture of RBF kernels:
$$k_{RQ}(r) = \int_0^\infty k_{RBF}(r; \ell') \cdot p(\ell' | \alpha, \ell) , d\ell'$$
where the length scales $\ell'$ are drawn from an inverse Gamma distribution. This means the RQ kernel automatically considers a range of possible length scales, weighted according to the inverse Gamma prior.
This property is particularly useful when:
Designing the right kernel is both an art and a science. It requires balancing mathematical constraints with domain knowledge and empirical validation.
You now understand the foundational principles of kernel design for Gaussian Processes. You can recognize the mathematical requirements for valid kernels, understand what properties different kernels encode, and make informed choices based on domain knowledge. Next, we'll explore how to combine kernels to create more expressive models that capture complex patterns in data.