Loading learning content...
One of the most powerful aspects of Gaussian Process modeling is that valid kernels can be combined to create new valid kernels. This compositional property lets us build sophisticated covariance structures from simple building blocks, encoding rich prior knowledge about the function we're modeling.
Consider modeling temperature over a year: it has a smooth daily cycle, a longer seasonal cycle, and a gradual multi-year trend. Rather than searching for a single kernel that captures all this structure, we can compose kernels: a periodic kernel for daily patterns, another for yearly patterns, and a linear kernel for trend. Each component remains interpretable while the composite captures the full complexity.
This page explores the mathematical foundations and practical techniques for kernel composition—a skill that separates novice GP practitioners from experts.
By the end of this page, you will understand: (1) the closure properties of positive semi-definite kernels, (2) how kernel addition models independent effects, (3) how kernel multiplication handles interactions and conditional structure, (4) advanced composition patterns for complex phenomena, (5) the kernel grammar approach to automated structure discovery.
Before combining kernels, we must understand which operations preserve validity. A kernel is valid (positive semi-definite) if for any set of points, the resulting Gram matrix has non-negative eigenvalues. Some operations preserve this property; others destroy it.
Fundamental Closure Properties
Let $k_1$ and $k_2$ be valid (PSD) kernels, and let $c > 0$ be a positive constant. The following operations yield valid kernels:
| Operation | Formula | Valid Kernel? |
|---|---|---|
| Addition | $k(x, x') = k_1(x, x') + k_2(x, x')$ | ✓ Yes |
| Scalar multiplication | $k(x, x') = c \cdot k_1(x, x')$ | ✓ Yes |
| Multiplication (product) | $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ | ✓ Yes |
| Point-wise limit | $k(x, x') = \lim_{n \to \infty} k_n(x, x')$ | ✓ Yes (if limit exists) |
| Exponentiation | $k(x, x') = \exp(k_1(x, x'))$ | ✓ Yes |
| Polynomial with positive coefficients | $k(x, x') = \sum_i a_i k_1(x, x')^i$ for $a_i \geq 0$ | ✓ Yes |
| Function composition | $k(x, x') = k_1(g(x), g(x'))$ for any function $g$ | ✓ Yes |
| Linear combination | $k(x, x') = \alpha k_1 + \beta k_2$ for $\alpha, \beta > 0$ | ✓ Yes |
Subtraction of kernels generally does NOT produce a valid kernel. Even if k₁(x,x') > k₂(x,x') for all x, x', the difference k₁ - k₂ may not be PSD. Similarly, division k₁/k₂ is generally not valid. Always stick to addition, multiplication, and positive scaling.
Why Do These Properties Hold?
Addition: If $\mathbf{K}_1$ and $\mathbf{K}_2$ are PSD matrices, then $\mathbf{K}_1 + \mathbf{K}_2$ is PSD because: $$\mathbf{v}^T(\mathbf{K}_1 + \mathbf{K}_2)\mathbf{v} = \mathbf{v}^T\mathbf{K}_1\mathbf{v} + \mathbf{v}^T\mathbf{K}_2\mathbf{v} \geq 0$$
Multiplication: The element-wise product of PSD matrices (Hadamard product) is PSD. This follows from the Schur product theorem: if $\mathbf{K}_1$ and $\mathbf{K}_2$ are PSD, then $\mathbf{K}_1 \circ \mathbf{K}_2$ (element-wise) is PSD.
Exponentiation: Follows from the power series $\exp(k) = \sum_{n=0}^{\infty} k^n/n!$, where each term $k^n$ is PSD (by repeated multiplication), and the sum converges.
These properties give us a grammar for building kernels: combine primitives (RBF, Matérn, Linear, Periodic) using valid operations to construct arbitrarily complex covariance structures.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom numpy.linalg import eigvalsh def check_psd(K, tolerance=1e-8): """ Check if a matrix is positive semi-definite. A matrix K is PSD if all eigenvalues are >= 0. In practice, we allow small negative eigenvalues due to numerical error. """ eigenvalues = eigvalsh(K) # Eigenvalues of symmetric matrix min_eig = np.min(eigenvalues) is_psd = min_eig >= -tolerance return is_psd, min_eig def demonstrate_closure_properties(): """ Demonstrate that kernel operations preserve PSD property. """ # Generate random test points np.random.seed(42) n = 50 X = np.random.randn(n, 2) # Define two base kernels def rbf_kernel(X, length_scale=1.0): sq_dists = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :])**2, axis=2) return np.exp(-sq_dists / (2 * length_scale**2)) def linear_kernel(X): return X @ X.T K_rbf = rbf_kernel(X, length_scale=1.0) K_linear = linear_kernel(X) print("Base kernels:") is_psd, min_eig = check_psd(K_rbf) print(f" RBF kernel: PSD={is_psd}, min eigenvalue={min_eig:.2e}") is_psd, min_eig = check_psd(K_linear) print(f" Linear kernel: PSD={is_psd}, min eigenvalue={min_eig:.2e}") # Test addition K_sum = K_rbf + K_linear is_psd, min_eig = check_psd(K_sum) print(f"\nAddition (RBF + Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}") # Test multiplication (Hadamard product) K_prod = K_rbf * K_linear is_psd, min_eig = check_psd(K_prod) print(f"Multiplication (RBF × Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}") # Test scalar multiplication K_scaled = 5.0 * K_rbf is_psd, min_eig = check_psd(K_scaled) print(f"Scalar multiplication (5 × RBF): PSD={is_psd}, min eigenvalue={min_eig:.2e}") # Test exponentiation K_exp = np.exp(K_rbf) # Note: exp of a valid kernel, not exp(-dist^2) is_psd, min_eig = check_psd(K_exp) print(f"Exponentiation (exp(RBF)): PSD={is_psd}, min eigenvalue={min_eig:.2e}") # WARNING: Test subtraction (NOT valid operation) K_diff = K_rbf - 0.5 * K_linear is_psd, min_eig = check_psd(K_diff) print(f"\n⚠️ Subtraction (RBF - 0.5×Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}") print(" Note: Subtraction does NOT preserve PSD property!") demonstrate_closure_properties()Kernel addition is the most common composition operation. When we add kernels, we model the function as a sum of independent components, each with its own characteristic structure.
Mathematical Interpretation
If $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$, then samples from the GP can be written as:
$$f(\mathbf{x}) = f_1(\mathbf{x}) + f_2(\mathbf{x})$$
where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent Gaussian Processes.
Intuition: Each component contributes independently to the overall function. The posterior can decompose contributions, attributing variance to each source.
Common Patterns Using Addition
| Pattern | Composition | Interpretation | Example Application |
|---|---|---|---|
| Trend + Residual | Linear + Matérn | Systematic trend plus correlated noise | Economic time series, growth models |
| Multi-scale smoothness | RBF(ℓ₁) + RBF(ℓ₂) | Long-range + short-range correlations | Spatial fields with multiple scales |
| Signal + Noise | Matérn + White | Smooth signal plus i.i.d. observation noise | Standard regression setup |
| Periodic + Aperiodic | Periodic + RBF | Cyclic pattern plus smooth deviations | Seasonal data with anomalies |
| Global + Local | Linear + Local Periodic | Overall direction plus local oscillations | Stock prices, climate data |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npimport matplotlib.pyplot as plt def rbf_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0): """Standard RBF (Squared Exponential) kernel.""" X1, X2 = np.atleast_2d(X1).T, np.atleast_2d(X2).T sq_dists = np.sum((X1[:, np.newaxis] - X2.T[np.newaxis, :])**2, axis=2) return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2)) def linear_kernel(X1, X2, sigma_sq=1.0, offset=0.0): """Linear (Bayesian linear regression) kernel.""" X1 = np.atleast_1d(X1).flatten() - offset X2 = np.atleast_1d(X2).flatten() - offset return sigma_sq * np.outer(X1, X2) def periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale=1.0): """Periodic kernel for cyclic patterns.""" X1, X2 = np.atleast_1d(X1).flatten(), np.atleast_1d(X2).flatten() dists = np.abs(np.subtract.outer(X1, X2)) return sigma_sq * np.exp(-2 * (np.sin(np.pi * dists / period) / length_scale)**2) def white_noise_kernel(n, sigma_sq=0.1): """White noise (diagonal) kernel for observation noise.""" return sigma_sq * np.eye(n) def demonstrate_additive_decomposition(): """ Demonstrate how additive kernels decompose into interpretable components. """ X = np.linspace(0, 10, 200) n = len(X) # Define component kernels K_trend = linear_kernel(X, X, sigma_sq=0.5, offset=5.0) K_smooth = rbf_kernel(X, X, sigma_sq=1.0, length_scale=2.0).reshape(n, n) K_periodic = periodic_kernel(X, X, sigma_sq=0.5, period=2.0, length_scale=0.5) K_noise = white_noise_kernel(n, sigma_sq=0.1) # Combined kernel K_total = K_trend + K_smooth + K_periodic + K_noise K_total += 1e-6 * np.eye(n) # Numerical stability fig, axes = plt.subplots(2, 3, figsize=(15, 10)) np.random.seed(42) # Sample each component independently components = { 'Trend (Linear)': K_trend, 'Smooth (RBF)': K_smooth, 'Periodic': K_periodic, } samples = {} col = 0 for name, K in components.items(): K_stable = K + 1e-6 * np.eye(n) L = np.linalg.cholesky(K_stable) z = np.random.randn(n) # Save for reuse samples[name] = L @ z axes[0, col].plot(X, samples[name], 'b-', linewidth=2) axes[0, col].set_title(f'Component: {name}') axes[0, col].set_xlabel('x') axes[0, col].set_ylabel('f(x)') axes[0, col].grid(True, alpha=0.3) col += 1 # Combined sample (sum of components + noise) noise = np.sqrt(0.1) * np.random.randn(n) f_total = samples['Trend (Linear)'] + samples['Smooth (RBF)'] + samples['Periodic'] + noise axes[1, 0].plot(X, f_total, 'k-', linewidth=2) axes[1, 0].set_title('Combined Function: Trend + Smooth + Periodic + Noise') axes[1, 0].set_xlabel('x') axes[1, 0].set_ylabel('f(x)') axes[1, 0].grid(True, alpha=0.3) # Show the stacked contributions axes[1, 1].fill_between(X, 0, samples['Trend (Linear)'], alpha=0.7, label='Trend') axes[1, 1].fill_between(X, samples['Trend (Linear)'], samples['Trend (Linear)'] + samples['Smooth (RBF)'], alpha=0.7, label='Smooth') axes[1, 1].fill_between(X, samples['Trend (Linear)'] + samples['Smooth (RBF)'], samples['Trend (Linear)'] + samples['Smooth (RBF)'] + samples['Periodic'], alpha=0.7, label='Periodic') axes[1, 1].set_title('Stacked Component Contributions') axes[1, 1].set_xlabel('x') axes[1, 1].set_ylabel('f(x)') axes[1, 1].legend() axes[1, 1].grid(True, alpha=0.3) # Covariance matrix of combined kernel im = axes[1, 2].imshow(K_total, extent=[0, 10, 10, 0], cmap='viridis') axes[1, 2].set_title('Combined Covariance Matrix') axes[1, 2].set_xlabel("x'") axes[1, 2].set_ylabel('x') plt.colorbar(im, ax=axes[1, 2]) plt.suptitle('Additive Kernel Composition: Decomposing Complex Functions', fontsize=14) plt.tight_layout() return fig def posterior_decomposition_example(): """ Show how posterior inference decomposes additive components. After observing data, we can attribute variance to each source. """ # Generate synthetic data from known components np.random.seed(42) X_train = np.sort(np.random.uniform(0, 10, 30)) # True components true_trend = 0.5 * (X_train - 5) true_periodic = 1.5 * np.sin(2 * np.pi * X_train / 3) true_smooth = np.zeros_like(X_train) # RBF contribution noise = 0.5 * np.random.randn(len(X_train)) y_train = true_trend + true_periodic + noise # Fit GP with additive kernel and decompose posterior # (Full implementation would compute posterior for each component) print("Posterior decomposition allows:") print("1. Estimating the trend component separately from oscillations") print("2. Quantifying uncertainty in each component") print("3. Forecasting each component independently") print("4. Identifying which component explains most variance") return X_train, y_train demonstrate_additive_decomposition()A powerful property of additive GPs: after observing data, we can decompose the posterior into contributions from each component. This is not just conceptual—we can compute E[f₁|y], Var[f₁|y] separately from f₂, attributing the observed variation to interpretable sources.
Kernel multiplication creates more complex interactions between components. Unlike addition, products don't decompose into independent pieces—they model situations where components modulate each other.
Mathematical Interpretation
If $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$, this is NOT equivalent to $f = f_1 \cdot f_2$ for independent GPs. Instead, the product kernel requires both kernels to indicate correlation for the product to be high.
Intuition: Think of $k_1$ as a 'gate' for $k_2$. If $k_1(\mathbf{x}, \mathbf{x}') = 0$, then the product is zero regardless of $k_2$. This creates sophisticated dependency structures.
Key Product Kernel Patterns
| Pattern | Composition | Effect | Use Case |
|---|---|---|---|
| Locally Periodic | Periodic × RBF | Periodicity that fades with distance | Quasi-periodic phenomena, music, speech |
| Localized Features | RBF(small ℓ) × RBF(large ℓ) | Features only exist within a region | Spatial heterogeneity |
| Damped Linear | Linear × RBF | Trend that applies only locally | Local trends in non-stationary data |
| Tensor Product | k₁(x₁)×k₂(x₂) | Separable multi-dimensional kernel | Space-time, multi-output GPs |
| ARD (Automatic Relevance Determination) | ∏ᵢ k(xᵢ) | Different length-scale per dimension | Feature selection, high-dimensional data |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import numpy as npimport matplotlib.pyplot as plt def locally_periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale_periodic=0.5, length_scale_decay=5.0): """ Locally Periodic = Periodic × RBF Creates patterns that repeat but fade with distance. The RBF acts as an envelope that modulates the periodic structure. """ X1 = np.atleast_1d(X1).flatten() X2 = np.atleast_1d(X2).flatten() # Periodic component dists = np.abs(np.subtract.outer(X1, X2)) K_periodic = np.exp(-2 * (np.sin(np.pi * dists / period) / length_scale_periodic)**2) # RBF decay envelope sq_dists = np.subtract.outer(X1, X2)**2 K_rbf = np.exp(-sq_dists / (2 * length_scale_decay**2)) # Product kernel return sigma_sq * K_periodic * K_rbf def demonstrate_product_vs_sum(): """ Compare additive vs multiplicative composition for Periodic + RBF. """ X = np.linspace(0, 20, 300) n = len(X) # Define base kernels def periodic(X1, X2, period=3.0, ell=0.5): dists = np.abs(np.subtract.outer(X1, X2)) return np.exp(-2 * (np.sin(np.pi * dists / period) / ell)**2) def rbf(X1, X2, ell=5.0): sq_dists = np.subtract.outer(X1, X2)**2 return np.exp(-sq_dists / (2 * ell**2)) K_per = periodic(X, X) K_rbf = rbf(X, X) # Additive: Periodic + RBF K_sum = K_per + K_rbf # Multiplicative: Periodic × RBF K_prod = K_per * K_rbf fig, axes = plt.subplots(2, 3, figsize=(15, 10)) np.random.seed(42) # Sample from periodic kernel L_per = np.linalg.cholesky(K_per + 1e-6*np.eye(n)) f_per = L_per @ np.random.randn(n) axes[0, 0].plot(X, f_per) axes[0, 0].set_title('Periodic Kernel Samples') axes[0, 0].grid(True, alpha=0.3) # Sample from RBF kernel L_rbf = np.linalg.cholesky(K_rbf + 1e-6*np.eye(n)) f_rbf = L_rbf @ np.random.randn(n) axes[0, 1].plot(X, f_rbf) axes[0, 1].set_title('RBF Kernel Samples') axes[0, 1].grid(True, alpha=0.3) # Covariance comparison axes[0, 2].plot(X, K_per[n//2, :], label='Periodic', linewidth=2) axes[0, 2].plot(X, K_rbf[n//2, :], label='RBF', linewidth=2) axes[0, 2].axvline(X[n//2], color='gray', linestyle='--', alpha=0.5) axes[0, 2].set_title('Covariance from Midpoint') axes[0, 2].legend() axes[0, 2].grid(True, alpha=0.3) # Sample from SUM kernel (Periodic + RBF) L_sum = np.linalg.cholesky(K_sum + 1e-6*np.eye(n)) for i in range(5): f_sum = L_sum @ np.random.randn(n) axes[1, 0].plot(X, f_sum, alpha=0.7) axes[1, 0].set_title('SUM: Periodic + RBF\n(Independent components)') axes[1, 0].grid(True, alpha=0.3) # Sample from PRODUCT kernel (Periodic × RBF) L_prod = np.linalg.cholesky(K_prod + 1e-6*np.eye(n)) for i in range(5): f_prod = L_prod @ np.random.randn(n) axes[1, 1].plot(X, f_prod, alpha=0.7) axes[1, 1].set_title('PRODUCT: Periodic × RBF\n(Locally periodic = decaying oscillations)') axes[1, 1].grid(True, alpha=0.3) # Covariance comparison for composed kernels axes[1, 2].plot(X, K_sum[n//2, :], label='Sum', linewidth=2) axes[1, 2].plot(X, K_prod[n//2, :], label='Product', linewidth=2) axes[1, 2].axvline(X[n//2], color='gray', linestyle='--', alpha=0.5) axes[1, 2].set_title('Composed Kernels: Covariance from Midpoint') axes[1, 2].legend() axes[1, 2].grid(True, alpha=0.3) plt.suptitle('Addition vs Multiplication: Fundamentally Different Behaviors', fontsize=14) plt.tight_layout() return fig def ard_kernel_example(): """ Automatic Relevance Determination: Product of per-dimension kernels. This is crucial for feature selection in high-dimensional problems. """ print("ARD Kernel (Automatic Relevance Determination):") print("="*50) print() print("k_ARD(x, x') = σ² · exp(-Σᵢ (xᵢ - x'ᵢ)² / (2ℓᵢ²))") print() print("Each dimension i has its own length-scale ℓᵢ") print() print("Feature Selection Interpretation:") print("- Large ℓᵢ → dimension i is irrelevant (flat kernel)") print("- Small ℓᵢ → dimension i is highly relevant (sensitive)") print() print("In practice: optimize all ℓᵢ via marginal likelihood") print("Irrelevant features get ℓᵢ → ∞ automatically") # Demonstrate in 2D with different relevances np.random.seed(42) n = 100 X = np.random.randn(n, 2) # Dimension 1 relevant (small length-scale), Dimension 2 irrelevant (large) length_scales = np.array([0.5, 10.0]) def ard_kernel(X1, X2, length_scales, sigma_sq=1.0): """ARD-RBF kernel.""" X1_scaled = X1 / length_scales X2_scaled = X2 / length_scales sq_dists = np.sum((X1_scaled[:, np.newaxis, :] - X2_scaled[np.newaxis, :, :])**2, axis=2) return sigma_sq * np.exp(-sq_dists / 2) K = ard_kernel(X, X, length_scales) print(f"\nExample: length_scales = {length_scales}") print(f"Dimension 1 (ℓ={length_scales[0]}): RELEVANT") print(f"Dimension 2 (ℓ={length_scales[1]}): IRRELEVANT") return K demonstrate_product_vs_sum()The Locally Periodic Kernel in Detail
The locally periodic pattern deserves special attention as it's one of the most useful compositions:
$$k_{\text{LP}}(x, x') = k_{\text{Periodic}}(x, x') \cdot k_{\text{RBF}}(x, x')$$
This models patterns that:
Applications:
The RBF component acts as a correlation envelope: patterns can repeat within this envelope, but distant points are uncorrelated regardless of their periodic relationship.
Beyond simple addition and multiplication, sophisticated compositions address specific modeling challenges. This section covers patterns that arise frequently in real applications.
Pattern 1: Change Points
Model functions that behave differently before and after a change point $c$:
$$k_{\text{CP}}(x, x') = k_1(x, x') \cdot \sigma(x)\sigma(x') + k_2(x, x') \cdot (1-\sigma(x))(1-\sigma(x'))$$
Where $\sigma(x) = \text{sigmoid}((x - c)/s)$ is a smooth transition function.
Before $c$: behavior governed by $k_1$ After $c$: behavior governed by $k_2$ Near $c$: smooth interpolation
Pattern 2: Hierarchical / Multi-Level Structure
For data with nested structure (e.g., measurements within subjects within groups):
$$k(x, x') = k_{\text{group}}(g(x), g(x')) + k_{\text{subject}}(s(x), s(x')) + k_{\text{residual}}(x, x')$$
Where $g(x)$ and $s(x)$ extract group and subject indicators.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import expit # Sigmoid function def changepoint_kernel(X1, X2, k1_fn, k2_fn, changepoint=5.0, steepness=2.0): """ Changepoint kernel: different behavior before/after a transition. k_CP(x,x') = k₁(x,x')·σ(x)σ(x') + k₂(x,x')·(1-σ(x))(1-σ(x')) Parameters: ----------- k1_fn, k2_fn : callables Kernel functions for before/after changepoint changepoint : float Location of the transition steepness : float How sharp the transition is (higher = sharper) """ X1 = np.atleast_1d(X1).flatten() X2 = np.atleast_1d(X2).flatten() # Transition function (sigmoid) sigma1 = expit(steepness * (X1 - changepoint)) # n1 sigma2 = expit(steepness * (X2 - changepoint)) # n2 # Create weight matrices w1 = np.outer(sigma1, sigma2) # After changepoint w2 = np.outer(1 - sigma1, 1 - sigma2) # Before changepoint # Combine kernels K1 = k1_fn(X1, X2) K2 = k2_fn(X1, X2) return K1 * w2 + K2 * w1 def demonstrate_changepoint(): """ Show how a changepoint kernel transitions between regimes. """ X = np.linspace(0, 10, 200) n = len(X) changepoint = 5.0 # Define two different kernels for before/after def k_smooth(X1, X2): """Smooth, slowly-varying behavior.""" sq_dists = np.subtract.outer(X1, X2)**2 return np.exp(-sq_dists / (2 * 2.0**2)) # Large length-scale def k_wiggly(X1, X2): """Rapidly varying behavior.""" sq_dists = np.subtract.outer(X1, X2)**2 return np.exp(-sq_dists / (2 * 0.3**2)) # Small length-scale K_cp = changepoint_kernel(X, X, k_smooth, k_wiggly, changepoint=5.0, steepness=2.0) K_cp += 1e-6 * np.eye(n) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Samples from changepoint kernel np.random.seed(42) L = np.linalg.cholesky(K_cp) for i in range(5): sample = L @ np.random.randn(n) axes[0].plot(X, sample, alpha=0.7) axes[0].axvline(changepoint, color='red', linestyle='--', label='Changepoint') axes[0].set_title('Changepoint Kernel: Smooth → Wiggly') axes[0].set_xlabel('x') axes[0].set_ylabel('f(x)') axes[0].legend() axes[0].grid(True, alpha=0.3) # Covariance matrix im = axes[1].imshow(K_cp, extent=[0, 10, 10, 0], cmap='viridis') axes[1].axhline(changepoint, color='red', linestyle='--') axes[1].axvline(changepoint, color='red', linestyle='--') axes[1].set_title('Covariance Matrix') axes[1].set_xlabel("x'") axes[1].set_ylabel('x') plt.colorbar(im, ax=axes[1]) # Transition weights sigma = expit(2.0 * (X - changepoint)) axes[2].plot(X, 1 - sigma, 'b-', linewidth=2, label='Weight for k₁ (smooth)') axes[2].plot(X, sigma, 'r-', linewidth=2, label='Weight for k₂ (wiggly)') axes[2].axvline(changepoint, color='gray', linestyle='--', alpha=0.5) axes[2].set_title('Transition Weights') axes[2].set_xlabel('x') axes[2].set_ylabel('Weight') axes[2].legend() axes[2].grid(True, alpha=0.3) plt.suptitle('Changepoint Kernel: Regime-Switching Behavior', fontsize=14) plt.tight_layout() return fig def multi_output_kernel(): """ Multi-output GPs using kernel composition. For modeling multiple related outputs (y₁, y₂, ...) jointly: K(x,x') = B ⊗ k(x,x') Where B is a positive semi-definite matrix of output correlations. """ print("Multi-Output (Vector-Valued) GPs via Kernel Composition") print("="*55) print() print("For D outputs, construct a DxD output correlation matrix B:") print(" B[i,j] = correlation between output i and output j") print() print("The full kernel is a Kronecker product:") print(" K_full((x,i), (x',j)) = B[i,j] × k_base(x, x')") print() print("This models:") print(" - Each output as a GP with kernel k_base") print(" - Cross-output correlations via B") print() print("Example: Modeling temperature and humidity jointly") print(" - k_base = Matérn captures spatial correlation") print(" - B captures that temperature and humidity are related") # Simple example: 2 outputs n_locations = 5 n_outputs = 2 # Correlation between outputs B = np.array([[1.0, 0.7], [0.7, 1.0]]) # Base kernel for locations X = np.array([0, 1, 2, 3, 4]) K_base = np.exp(-np.subtract.outer(X, X)**2) # Full kernel via Kronecker product K_full = np.kron(B, K_base) print(f"\nBase kernel shape: {K_base.shape}") print(f"Output correlation matrix B shape: {B.shape}") print(f"Full multi-output kernel shape: {K_full.shape}") return K_full demonstrate_changepoint()The compositional rules for kernels form a grammar—a set of rules for generating valid expressions. This observation led to the Automatic Statistician project and related work on kernel structure search, which automatically discovers interpretable kernel structures from data.
The Grammar
Define a set of base kernels: {RBF, Linear, Periodic, ...}
Define operations: {+, ×}
The kernel grammar generates all valid kernels by recursive application:
K → Base
K → K + K
K → K × K
This generates a (countably infinite) set of kernels: RBF, Linear, RBF+Linear, RBF×Periodic, (RBF+Linear)×Periodic, etc.
Interpretation
Each grammar production has interpretable meaning:
A key benefit of the kernel grammar: discovered structures can be translated to natural language! For example, 'Linear + Periodic × RBF' becomes 'a linear trend plus a periodic pattern that decays over time.' This makes GP models interpretable even when automatically discovered.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npfrom itertools import product class KernelExpression: """ Abstract syntax tree for kernel expressions. Enables composition and natural language interpretation. """ pass class BaseKernel(KernelExpression): def __init__(self, name, description): self.name = name self.description = description def __repr__(self): return self.name def describe(self): return self.description class SumKernel(KernelExpression): def __init__(self, left, right): self.left = left self.right = right def __repr__(self): return f"({self.left} + {self.right})" def describe(self): return f"{self.left.describe()}, plus independently {self.right.describe()}" class ProductKernel(KernelExpression): def __init__(self, left, right): self.left = left self.right = right def __repr__(self): return f"({self.left} × {self.right})" def describe(self): return f"{self.right.describe()}, but only where {self.left.describe()} applies" # Define base kernels with interpretationsRBF = BaseKernel("RBF", "smooth, locally correlated variations")LIN = BaseKernel("Linear", "a linear trend")PER = BaseKernel("Periodic", "repeating periodic patterns")WN = BaseKernel("WhiteNoise", "independent noise at each point") def generate_depth_1_kernels(bases): """Generate all kernels of depth 1 (single base).""" return bases.copy() def generate_depth_2_kernels(bases): """Generate all kernels of depth 2 (base op base).""" results = [] for b1, b2 in product(bases, repeat=2): results.append(SumKernel(b1, b2)) results.append(ProductKernel(b1, b2)) return results def demonstrate_grammar(): """Show how the grammar generates interpretable expressions.""" print("Kernel Grammar: Compositional Structure Discovery") print("="*55) print() bases = [RBF, LIN, PER] print("Base Kernels:") for b in bases: print(f" {b.name}: {b.description}") print() print("Example Composed Kernels and Their Interpretations:") print("-"*55) # Manual examples of useful compositions examples = [ SumKernel(LIN, RBF), ProductKernel(PER, RBF), SumKernel(LIN, ProductKernel(PER, RBF)), ProductKernel(LIN, PER), SumKernel(SumKernel(LIN, PER), RBF), ] for expr in examples: print(f"\n Expression: {expr}") print(f" Meaning: The function exhibits {expr.describe()}") print() print("-"*55) print("KEY INSIGHT: The grammar enables automatic model search!") print("Algorithms can enumerate expressions and select the best via") print("marginal likelihood, then explain findings in natural language.") def greedy_kernel_search(X, y, max_depth=3, bases=None): """ Simplified greedy kernel structure search. Real implementations (like GPy's kernel_search or AutoGP) use: - Efficient marginal likelihood computation - Beam search or MCTS for exploration - Regularization to prevent overfitting """ if bases is None: bases = [RBF, LIN, PER] print("Greedy Kernel Search Algorithm:") print("="*40) print("1. Start with each base kernel") print("2. Evaluate marginal likelihood for current best") print("3. Try all single expansions: k+base, k×base") print("4. Keep best expansion if it improves") print("5. Repeat until max depth or no improvement") print() print("This finds interpretable structure automatically!") # Pseudocode for actual implementation: """ current_best = argmax(bases, key=compute_log_ml) for depth in range(max_depth): candidates = [] for base in bases: candidates.append(SumKernel(current_best, base)) candidates.append(ProductKernel(current_best, base)) new_best = argmax(candidates, key=compute_log_ml) if log_ml(new_best) <= log_ml(current_best) + threshold: break # No improvement current_best = new_best return current_best """ demonstrate_grammar()Structure Search Algorithms
Several approaches exist for searching the kernel grammar:
Greedy expansion: Start with base kernels, iteratively add the operation that most improves marginal likelihood
Compositional Kernel Search (CKS): Use Bayesian optimization over the discrete space of kernel structures
Neural architecture search adaptations: Treat kernel structure as a differentiable architecture
Monte Carlo Tree Search (MCTS): Explore the tree of possible kernel expressions using UCB-style exploration
Practical Tools:
kern.add() and kern.prod() for manual compositionWhen to Use Automated Search:
Designing kernels for a real problem requires balancing domain knowledge, data exploration, and principled model comparison. This section provides a practical workflow.
✓ Residuals appear i.i.d. (no spatial/temporal patterns) ✓ Predictions are sensible extrapolations ✓ Uncertainty grows appropriately in data-sparse regions ✓ Kernel structure has a plausible interpretation ✓ Cross-validation confirms training-set LML
✗ Many hyperparameters relative to data points ✗ Very small length-scales (overfitting) ✗ Predictions collapse or explode outside training range ✗ Cross-validation much worse than training LML ✗ Kernel structure uninterpretable
Common Mistakes and How to Avoid Them
| Mistake | Symptom | Solution |
|---|---|---|
| Too many components | Poor cross-validation, many small hyperparameters | Remove least important components |
| Ignoring periodicity | Periodic residuals, poor forecasting | Add Periodic×RBF component |
| Wrong periodic frequency | Resonance patterns in predictions | Check data for true period, try learning p |
| Rigid periodicity | Mismatch between model and evolving patterns | Use Locally Periodic (Periodic×RBF) |
| Linear without bound | Unrealistic extrapolation | Cap Linear with saturating function or RBF |
| Ignoring non-stationarity | Spatially-varying residual patterns | Add Gibbs kernel or input warping |
Final Advice: Kernel design is iterative. Expect to go through several cycles of 'try → evaluate → refine'. Document your reasoning so you (and others) can understand the final model later.
Kernel composition is the key to building expressive, interpretable GP models. By combining simple building blocks, we can encode sophisticated prior knowledge about function behavior.
What's Next:
We've seen how to build kernels from components. The next page explores spectral analysis—understanding kernels through their Fourier representations. This perspective provides deep insight into what stationary kernels encode and enables the powerful Spectral Mixture Kernel that can approximate any stationary kernel.
You now understand kernel composition: the mathematical foundations, additive vs multiplicative patterns, advanced compositions like changepoints, and the kernel grammar for automated discovery. This toolkit enables you to design GP models that match the true structure of your data.