Machine LearningGaussian Processes

Kernel Design for Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

3 / 5

Kernel Composition

The Algebra of Kernels

One of the most powerful aspects of Gaussian Process modeling is that valid kernels can be combined to create new valid kernels. This compositional property lets us build sophisticated covariance structures from simple building blocks, encoding rich prior knowledge about the function we're modeling.

Consider modeling temperature over a year: it has a smooth daily cycle, a longer seasonal cycle, and a gradual multi-year trend. Rather than searching for a single kernel that captures all this structure, we can compose kernels: a periodic kernel for daily patterns, another for yearly patterns, and a linear kernel for trend. Each component remains interpretable while the composite captures the full complexity.

This page explores the mathematical foundations and practical techniques for kernel composition—a skill that separates novice GP practitioners from experts.

What You Will Learn

By the end of this page, you will understand: (1) the closure properties of positive semi-definite kernels, (2) how kernel addition models independent effects, (3) how kernel multiplication handles interactions and conditional structure, (4) advanced composition patterns for complex phenomena, (5) the kernel grammar approach to automated structure discovery.

Closure Properties of Positive Semi-Definite Kernels

Before combining kernels, we must understand which operations preserve validity. A kernel is valid (positive semi-definite) if for any set of points, the resulting Gram matrix has non-negative eigenvalues. Some operations preserve this property; others destroy it.

Fundamental Closure Properties

Let $k_1$ and $k_2$ be valid (PSD) kernels, and let $c > 0$ be a positive constant. The following operations yield valid kernels:

Operation	Formula	Valid Kernel?
Addition	$k(x, x') = k_1(x, x') + k_2(x, x')$	✓ Yes
Scalar multiplication	$k(x, x') = c \cdot k_1(x, x')$	✓ Yes
Multiplication (product)	$k(x, x') = k_1(x, x') \cdot k_2(x, x')$	✓ Yes
Point-wise limit	$k(x, x') = \lim_{n \to \infty} k_n(x, x')$	✓ Yes (if limit exists)
Exponentiation	$k(x, x') = \exp(k_1(x, x'))$	✓ Yes
Polynomial with positive coefficients	$k(x, x') = \sum_i a_i k_1(x, x')^i$ for $a_i \geq 0$	✓ Yes
Function composition	$k(x, x') = k_1(g(x), g(x'))$ for any function $g$	✓ Yes
Linear combination	$k(x, x') = \alpha k_1 + \beta k_2$ for $\alpha, \beta > 0$	✓ Yes

Operations That Don't Preserve Validity

Subtraction of kernels generally does NOT produce a valid kernel. Even if k₁(x,x') > k₂(x,x') for all x, x', the difference k₁ - k₂ may not be PSD. Similarly, division k₁/k₂ is generally not valid. Always stick to addition, multiplication, and positive scaling.

Why Do These Properties Hold?

Addition: If $\mathbf{K}_1$ and $\mathbf{K}_2$ are PSD matrices, then $\mathbf{K}_1 + \mathbf{K}_2$ is PSD because: $$\mathbf{v}^T(\mathbf{K}_1 + \mathbf{K}_2)\mathbf{v} = \mathbf{v}^T\mathbf{K}_1\mathbf{v} + \mathbf{v}^T\mathbf{K}_2\mathbf{v} \geq 0$$

Multiplication: The element-wise product of PSD matrices (Hadamard product) is PSD. This follows from the Schur product theorem: if $\mathbf{K}_1$ and $\mathbf{K}_2$ are PSD, then $\mathbf{K}_1 \circ \mathbf{K}_2$ (element-wise) is PSD.

Exponentiation: Follows from the power series $\exp(k) = \sum_{n=0}^{\infty} k^n/n!$, where each term $k^n$ is PSD (by repeated multiplication), and the sum converges.

These properties give us a grammar for building kernels: combine primitives (RBF, Matérn, Linear, Periodic) using valid operations to construct arbitrarily complex covariance structures.

kernel_validity_check.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from numpy.linalg import eigvalsh
 
def check_psd(K, tolerance=1e-8):
    """
    Check if a matrix is positive semi-definite.
    
    A matrix K is PSD if all eigenvalues are >= 0.
    In practice, we allow small negative eigenvalues due to numerical error.
    """
    eigenvalues = eigvalsh(K)  # Eigenvalues of symmetric matrix
    min_eig = np.min(eigenvalues)
    is_psd = min_eig >= -tolerance
    return is_psd, min_eig
 
def demonstrate_closure_properties():
    """
    Demonstrate that kernel operations preserve PSD property.
    """
    # Generate random test points
    np.random.seed(42)
    n = 50
    X = np.random.randn(n, 2)
    
    # Define two base kernels
    def rbf_kernel(X, length_scale=1.0):
        sq_dists = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :])**2, axis=2)
        return np.exp(-sq_dists / (2 * length_scale**2))
    
    def linear_kernel(X):
        return X @ X.T
    
    K_rbf = rbf_kernel(X, length_scale=1.0)
    K_linear = linear_kernel(X)
    
    print("Base kernels:")
    is_psd, min_eig = check_psd(K_rbf)
    print(f"  RBF kernel: PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    is_psd, min_eig = check_psd(K_linear)
    print(f"  Linear kernel: PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test addition
    K_sum = K_rbf + K_linear
    is_psd, min_eig = check_psd(K_sum)
    print(f"\nAddition (RBF + Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test multiplication (Hadamard product)
    K_prod = K_rbf * K_linear
    is_psd, min_eig = check_psd(K_prod)
    print(f"Multiplication (RBF × Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test scalar multiplication
    K_scaled = 5.0 * K_rbf
    is_psd, min_eig = check_psd(K_scaled)
    print(f"Scalar multiplication (5 × RBF): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test exponentiation
    K_exp = np.exp(K_rbf)  # Note: exp of a valid kernel, not exp(-dist^2)
    is_psd, min_eig = check_psd(K_exp)
    print(f"Exponentiation (exp(RBF)): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # WARNING: Test subtraction (NOT valid operation)
    K_diff = K_rbf - 0.5 * K_linear
    is_psd, min_eig = check_psd(K_diff)
    print(f"\n⚠️ Subtraction (RBF - 0.5×Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    print("   Note: Subtraction does NOT preserve PSD property!")
    
demonstrate_closure_properties()

Kernel Addition: Modeling Independent Effects

Kernel addition is the most common composition operation. When we add kernels, we model the function as a sum of independent components, each with its own characteristic structure.

Mathematical Interpretation

If $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$, then samples from the GP can be written as:

$$f(\mathbf{x}) = f_1(\mathbf{x}) + f_2(\mathbf{x})$$

where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent Gaussian Processes.

Intuition: Each component contributes independently to the overall function. The posterior can decompose contributions, attributing variance to each source.

Common Patterns Using Addition

Additive Kernel Compositions
Pattern	Composition	Interpretation	Example Application
Trend + Residual	Linear + Matérn	Systematic trend plus correlated noise	Economic time series, growth models
Multi-scale smoothness	RBF(ℓ₁) + RBF(ℓ₂)	Long-range + short-range correlations	Spatial fields with multiple scales
Signal + Noise	Matérn + White	Smooth signal plus i.i.d. observation noise	Standard regression setup
Periodic + Aperiodic	Periodic + RBF	Cyclic pattern plus smooth deviations	Seasonal data with anomalies
Global + Local	Linear + Local Periodic	Overall direction plus local oscillations	Stock prices, climate data

additive_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0):
    """Standard RBF (Squared Exponential) kernel."""
    X1, X2 = np.atleast_2d(X1).T, np.atleast_2d(X2).T
    sq_dists = np.sum((X1[:, np.newaxis] - X2.T[np.newaxis, :])**2, axis=2)
    return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
 
def linear_kernel(X1, X2, sigma_sq=1.0, offset=0.0):
    """Linear (Bayesian linear regression) kernel."""
    X1 = np.atleast_1d(X1).flatten() - offset
    X2 = np.atleast_1d(X2).flatten() - offset
    return sigma_sq * np.outer(X1, X2)
 
def periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale=1.0):
    """Periodic kernel for cyclic patterns."""
    X1, X2 = np.atleast_1d(X1).flatten(), np.atleast_1d(X2).flatten()
    dists = np.abs(np.subtract.outer(X1, X2))
    return sigma_sq * np.exp(-2 * (np.sin(np.pi * dists / period) / length_scale)**2)
 
def white_noise_kernel(n, sigma_sq=0.1):
    """White noise (diagonal) kernel for observation noise."""
    return sigma_sq * np.eye(n)
 
def demonstrate_additive_decomposition():
    """
    Demonstrate how additive kernels decompose into interpretable components.
    """
    X = np.linspace(0, 10, 200)
    n = len(X)
    
    # Define component kernels
    K_trend = linear_kernel(X, X, sigma_sq=0.5, offset=5.0)
    K_smooth = rbf_kernel(X, X, sigma_sq=1.0, length_scale=2.0).reshape(n, n)
    K_periodic = periodic_kernel(X, X, sigma_sq=0.5, period=2.0, length_scale=0.5)
    K_noise = white_noise_kernel(n, sigma_sq=0.1)
    
    # Combined kernel
    K_total = K_trend + K_smooth + K_periodic + K_noise
    K_total += 1e-6 * np.eye(n)  # Numerical stability
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    np.random.seed(42)
    
    # Sample each component independently
    components = {
        'Trend (Linear)': K_trend,
        'Smooth (RBF)': K_smooth,
        'Periodic': K_periodic,
    }
    
    samples = {}
    col = 0
    for name, K in components.items():
        K_stable = K + 1e-6 * np.eye(n)
        L = np.linalg.cholesky(K_stable)
        z = np.random.randn(n)  # Save for reuse
        samples[name] = L @ z
        
        axes[0, col].plot(X, samples[name], 'b-', linewidth=2)
        axes[0, col].set_title(f'Component: {name}')
        axes[0, col].set_xlabel('x')
        axes[0, col].set_ylabel('f(x)')
        axes[0, col].grid(True, alpha=0.3)
        col += 1
    
    # Combined sample (sum of components + noise)
    noise = np.sqrt(0.1) * np.random.randn(n)
    f_total = samples['Trend (Linear)'] + samples['Smooth (RBF)'] + samples['Periodic'] + noise
    
    axes[1, 0].plot(X, f_total, 'k-', linewidth=2)
    axes[1, 0].set_title('Combined Function: Trend + Smooth + Periodic + Noise')
    axes[1, 0].set_xlabel('x')
    axes[1, 0].set_ylabel('f(x)')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Show the stacked contributions
    axes[1, 1].fill_between(X, 0, samples['Trend (Linear)'], alpha=0.7, label='Trend')
    axes[1, 1].fill_between(X, samples['Trend (Linear)'], 
                            samples['Trend (Linear)'] + samples['Smooth (RBF)'], 
                            alpha=0.7, label='Smooth')
    axes[1, 1].fill_between(X, samples['Trend (Linear)'] + samples['Smooth (RBF)'],
                            samples['Trend (Linear)'] + samples['Smooth (RBF)'] + samples['Periodic'],
                            alpha=0.7, label='Periodic')
    axes[1, 1].set_title('Stacked Component Contributions')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # Covariance matrix of combined kernel
    im = axes[1, 2].imshow(K_total, extent=[0, 10, 10, 0], cmap='viridis')
    axes[1, 2].set_title('Combined Covariance Matrix')
    axes[1, 2].set_xlabel("x'")
    axes[1, 2].set_ylabel('x')
    plt.colorbar(im, ax=axes[1, 2])
    
    plt.suptitle('Additive Kernel Composition: Decomposing Complex Functions', fontsize=14)
    plt.tight_layout()
    return fig
 
def posterior_decomposition_example():
    """
    Show how posterior inference decomposes additive components.
    After observing data, we can attribute variance to each source.
    """
    # Generate synthetic data from known components
    np.random.seed(42)
    X_train = np.sort(np.random.uniform(0, 10, 30))
    
    # True components
    true_trend = 0.5 * (X_train - 5)
    true_periodic = 1.5 * np.sin(2 * np.pi * X_train / 3)
    true_smooth = np.zeros_like(X_train)  # RBF contribution
    noise = 0.5 * np.random.randn(len(X_train))
    
    y_train = true_trend + true_periodic + noise
    
    # Fit GP with additive kernel and decompose posterior
    # (Full implementation would compute posterior for each component)
    
    print("Posterior decomposition allows:")
    print("1. Estimating the trend component separately from oscillations")
    print("2. Quantifying uncertainty in each component")
    print("3. Forecasting each component independently")
    print("4. Identifying which component explains most variance")
    
    return X_train, y_train
 
demonstrate_additive_decomposition()

Posterior Decomposition

A powerful property of additive GPs: after observing data, we can decompose the posterior into contributions from each component. This is not just conceptual—we can compute E[f₁|y], Var[f₁|y] separately from f₂, attributing the observed variation to interpretable sources.

Kernel Multiplication: Modeling Interactions

Kernel multiplication creates more complex interactions between components. Unlike addition, products don't decompose into independent pieces—they model situations where components modulate each other.

Mathematical Interpretation

If $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$, this is NOT equivalent to $f = f_1 \cdot f_2$ for independent GPs. Instead, the product kernel requires both kernels to indicate correlation for the product to be high.

Intuition: Think of $k_1$ as a 'gate' for $k_2$. If $k_1(\mathbf{x}, \mathbf{x}') = 0$, then the product is zero regardless of $k_2$. This creates sophisticated dependency structures.

Key Product Kernel Patterns

Multiplicative Kernel Compositions
Pattern	Composition	Effect	Use Case
Locally Periodic	Periodic × RBF	Periodicity that fades with distance	Quasi-periodic phenomena, music, speech
Localized Features	RBF(small ℓ) × RBF(large ℓ)	Features only exist within a region	Spatial heterogeneity
Damped Linear	Linear × RBF	Trend that applies only locally	Local trends in non-stationary data
Tensor Product	k₁(x₁)×k₂(x₂)	Separable multi-dimensional kernel	Space-time, multi-output GPs
ARD (Automatic Relevance Determination)	∏ᵢ k(xᵢ)	Different length-scale per dimension	Feature selection, high-dimensional data

multiplicative_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
import matplotlib.pyplot as plt
 
def locally_periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, 
                            length_scale_periodic=0.5, length_scale_decay=5.0):
    """
    Locally Periodic = Periodic × RBF
    
    Creates patterns that repeat but fade with distance.
    The RBF acts as an envelope that modulates the periodic structure.
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    # Periodic component
    dists = np.abs(np.subtract.outer(X1, X2))
    K_periodic = np.exp(-2 * (np.sin(np.pi * dists / period) / length_scale_periodic)**2)
    
    # RBF decay envelope
    sq_dists = np.subtract.outer(X1, X2)**2
    K_rbf = np.exp(-sq_dists / (2 * length_scale_decay**2))
    
    # Product kernel
    return sigma_sq * K_periodic * K_rbf
 
def demonstrate_product_vs_sum():
    """
    Compare additive vs multiplicative composition for Periodic + RBF.
    """
    X = np.linspace(0, 20, 300)
    n = len(X)
    
    # Define base kernels
    def periodic(X1, X2, period=3.0, ell=0.5):
        dists = np.abs(np.subtract.outer(X1, X2))
        return np.exp(-2 * (np.sin(np.pi * dists / period) / ell)**2)
    
    def rbf(X1, X2, ell=5.0):
        sq_dists = np.subtract.outer(X1, X2)**2
        return np.exp(-sq_dists / (2 * ell**2))
    
    K_per = periodic(X, X)
    K_rbf = rbf(X, X)
    
    # Additive: Periodic + RBF
    K_sum = K_per + K_rbf
    
    # Multiplicative: Periodic × RBF
    K_prod = K_per * K_rbf
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    np.random.seed(42)
    
    # Sample from periodic kernel
    L_per = np.linalg.cholesky(K_per + 1e-6*np.eye(n))
    f_per = L_per @ np.random.randn(n)
    axes[0, 0].plot(X, f_per)
    axes[0, 0].set_title('Periodic Kernel Samples')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Sample from RBF kernel
    L_rbf = np.linalg.cholesky(K_rbf + 1e-6*np.eye(n))
    f_rbf = L_rbf @ np.random.randn(n)
    axes[0, 1].plot(X, f_rbf)
    axes[0, 1].set_title('RBF Kernel Samples')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Covariance comparison
    axes[0, 2].plot(X, K_per[n//2, :], label='Periodic', linewidth=2)
    axes[0, 2].plot(X, K_rbf[n//2, :], label='RBF', linewidth=2)
    axes[0, 2].axvline(X[n//2], color='gray', linestyle='--', alpha=0.5)
    axes[0, 2].set_title('Covariance from Midpoint')
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # Sample from SUM kernel (Periodic + RBF)
    L_sum = np.linalg.cholesky(K_sum + 1e-6*np.eye(n))
    for i in range(5):
        f_sum = L_sum @ np.random.randn(n)
        axes[1, 0].plot(X, f_sum, alpha=0.7)
    axes[1, 0].set_title('SUM: Periodic + RBF\n(Independent components)')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Sample from PRODUCT kernel (Periodic × RBF)
    L_prod = np.linalg.cholesky(K_prod + 1e-6*np.eye(n))
    for i in range(5):
        f_prod = L_prod @ np.random.randn(n)
        axes[1, 1].plot(X, f_prod, alpha=0.7)
    axes[1, 1].set_title('PRODUCT: Periodic × RBF\n(Locally periodic = decaying oscillations)')
    axes[1, 1].grid(True, alpha=0.3)
    
    # Covariance comparison for composed kernels
    axes[1, 2].plot(X, K_sum[n//2, :], label='Sum', linewidth=2)
    axes[1, 2].plot(X, K_prod[n//2, :], label='Product', linewidth=2)
    axes[1, 2].axvline(X[n//2], color='gray', linestyle='--', alpha=0.5)
    axes[1, 2].set_title('Composed Kernels: Covariance from Midpoint')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.suptitle('Addition vs Multiplication: Fundamentally Different Behaviors', fontsize=14)
    plt.tight_layout()
    return fig
 
def ard_kernel_example():
    """
    Automatic Relevance Determination: Product of per-dimension kernels.
    
    This is crucial for feature selection in high-dimensional problems.
    """
    print("ARD Kernel (Automatic Relevance Determination):")
    print("="*50)
    print()
    print("k_ARD(x, x') = σ² · exp(-Σᵢ (xᵢ - x'ᵢ)² / (2ℓᵢ²))")
    print()
    print("Each dimension i has its own length-scale ℓᵢ")
    print()
    print("Feature Selection Interpretation:")
    print("- Large ℓᵢ → dimension i is irrelevant (flat kernel)")
    print("- Small ℓᵢ → dimension i is highly relevant (sensitive)")
    print()
    print("In practice: optimize all ℓᵢ via marginal likelihood")
    print("Irrelevant features get ℓᵢ → ∞ automatically")
    
    # Demonstrate in 2D with different relevances
    np.random.seed(42)
    n = 100
    X = np.random.randn(n, 2)
    
    # Dimension 1 relevant (small length-scale), Dimension 2 irrelevant (large)
    length_scales = np.array([0.5, 10.0])
    
    def ard_kernel(X1, X2, length_scales, sigma_sq=1.0):
        """ARD-RBF kernel."""
        X1_scaled = X1 / length_scales
        X2_scaled = X2 / length_scales
        sq_dists = np.sum((X1_scaled[:, np.newaxis, :] - X2_scaled[np.newaxis, :, :])**2, axis=2)
        return sigma_sq * np.exp(-sq_dists / 2)
    
    K = ard_kernel(X, X, length_scales)
    
    print(f"\nExample: length_scales = {length_scales}")
    print(f"Dimension 1 (ℓ={length_scales[0]}): RELEVANT")
    print(f"Dimension 2 (ℓ={length_scales[1]}): IRRELEVANT")
    
    return K
 
demonstrate_product_vs_sum()

The Locally Periodic Kernel in Detail

The locally periodic pattern deserves special attention as it's one of the most useful compositions:

$$k_{\text{LP}}(x, x') = k_{\text{Periodic}}(x, x') \cdot k_{\text{RBF}}(x, x')$$

This models patterns that:

Have a repeating structure (set by the period $p$)
Are only correlated over a limited range (set by the RBF length-scale)

Applications:

Audio signals where pitch repeats but timbre changes
Seasonal effects that don't persist identically year-over-year
Heartbeat data where the rhythm is regular but exact shape varies
Economic cycles that repeat but evolve over time

The RBF component acts as a correlation envelope: patterns can repeat within this envelope, but distant points are uncorrelated regardless of their periodic relationship.

Advanced Composition Patterns

Beyond simple addition and multiplication, sophisticated compositions address specific modeling challenges. This section covers patterns that arise frequently in real applications.

Pattern 1: Change Points

Model functions that behave differently before and after a change point $c$:

$$k_{\text{CP}}(x, x') = k_1(x, x') \cdot \sigma(x)\sigma(x') + k_2(x, x') \cdot (1-\sigma(x))(1-\sigma(x'))$$

Where $\sigma(x) = \text{sigmoid}((x - c)/s)$ is a smooth transition function.

Before $c$: behavior governed by $k_1$ After $c$: behavior governed by $k_2$ Near $c$: smooth interpolation

Pattern 2: Hierarchical / Multi-Level Structure

For data with nested structure (e.g., measurements within subjects within groups):

$$k(x, x') = k_{\text{group}}(g(x), g(x')) + k_{\text{subject}}(s(x), s(x')) + k_{\text{residual}}(x, x')$$

Where $g(x)$ and $s(x)$ extract group and subject indicators.

advanced_compositions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit  # Sigmoid function
 
def changepoint_kernel(X1, X2, k1_fn, k2_fn, changepoint=5.0, steepness=2.0):
    """
    Changepoint kernel: different behavior before/after a transition.
    
    k_CP(x,x') = k₁(x,x')·σ(x)σ(x') + k₂(x,x')·(1-σ(x))(1-σ(x'))
    
    Parameters:
    -----------
    k1_fn, k2_fn : callables
        Kernel functions for before/after changepoint
    changepoint : float
        Location of the transition
    steepness : float
        How sharp the transition is (higher = sharper)
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    # Transition function (sigmoid)
    sigma1 = expit(steepness * (X1 - changepoint))  # n1
    sigma2 = expit(steepness * (X2 - changepoint))  # n2
    
    # Create weight matrices
    w1 = np.outer(sigma1, sigma2)          # After changepoint
    w2 = np.outer(1 - sigma1, 1 - sigma2)  # Before changepoint
    
    # Combine kernels
    K1 = k1_fn(X1, X2)
    K2 = k2_fn(X1, X2)
    
    return K1 * w2 + K2 * w1
 
def demonstrate_changepoint():
    """
    Show how a changepoint kernel transitions between regimes.
    """
    X = np.linspace(0, 10, 200)
    n = len(X)
    changepoint = 5.0
    
    # Define two different kernels for before/after
    def k_smooth(X1, X2):
        """Smooth, slowly-varying behavior."""
        sq_dists = np.subtract.outer(X1, X2)**2
        return np.exp(-sq_dists / (2 * 2.0**2))  # Large length-scale
    
    def k_wiggly(X1, X2):
        """Rapidly varying behavior."""
        sq_dists = np.subtract.outer(X1, X2)**2
        return np.exp(-sq_dists / (2 * 0.3**2))  # Small length-scale
    
    K_cp = changepoint_kernel(X, X, k_smooth, k_wiggly, changepoint=5.0, steepness=2.0)
    K_cp += 1e-6 * np.eye(n)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Samples from changepoint kernel
    np.random.seed(42)
    L = np.linalg.cholesky(K_cp)
    for i in range(5):
        sample = L @ np.random.randn(n)
        axes[0].plot(X, sample, alpha=0.7)
    axes[0].axvline(changepoint, color='red', linestyle='--', label='Changepoint')
    axes[0].set_title('Changepoint Kernel: Smooth → Wiggly')
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Covariance matrix
    im = axes[1].imshow(K_cp, extent=[0, 10, 10, 0], cmap='viridis')
    axes[1].axhline(changepoint, color='red', linestyle='--')
    axes[1].axvline(changepoint, color='red', linestyle='--')
    axes[1].set_title('Covariance Matrix')
    axes[1].set_xlabel("x'")
    axes[1].set_ylabel('x')
    plt.colorbar(im, ax=axes[1])
    
    # Transition weights
    sigma = expit(2.0 * (X - changepoint))
    axes[2].plot(X, 1 - sigma, 'b-', linewidth=2, label='Weight for k₁ (smooth)')
    axes[2].plot(X, sigma, 'r-', linewidth=2, label='Weight for k₂ (wiggly)')
    axes[2].axvline(changepoint, color='gray', linestyle='--', alpha=0.5)
    axes[2].set_title('Transition Weights')
    axes[2].set_xlabel('x')
    axes[2].set_ylabel('Weight')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.suptitle('Changepoint Kernel: Regime-Switching Behavior', fontsize=14)
    plt.tight_layout()
    return fig
 
def multi_output_kernel():
    """
    Multi-output GPs using kernel composition.
    
    For modeling multiple related outputs (y₁, y₂, ...) jointly:
    K(x,x') = B ⊗ k(x,x')
    
    Where B is a positive semi-definite matrix of output correlations.
    """
    print("Multi-Output (Vector-Valued) GPs via Kernel Composition")
    print("="*55)
    print()
    print("For D outputs, construct a DxD output correlation matrix B:")
    print("  B[i,j] = correlation between output i and output j")
    print()
    print("The full kernel is a Kronecker product:")
    print("  K_full((x,i), (x',j)) = B[i,j] × k_base(x, x')")
    print()
    print("This models:")
    print("  - Each output as a GP with kernel k_base")
    print("  - Cross-output correlations via B")
    print()
    print("Example: Modeling temperature and humidity jointly")
    print("  - k_base = Matérn captures spatial correlation")
    print("  - B captures that temperature and humidity are related")
    
    # Simple example: 2 outputs
    n_locations = 5
    n_outputs = 2
    
    # Correlation between outputs
    B = np.array([[1.0, 0.7],
                  [0.7, 1.0]])
    
    # Base kernel for locations
    X = np.array([0, 1, 2, 3, 4])
    K_base = np.exp(-np.subtract.outer(X, X)**2)
    
    # Full kernel via Kronecker product
    K_full = np.kron(B, K_base)
    
    print(f"\nBase kernel shape: {K_base.shape}")
    print(f"Output correlation matrix B shape: {B.shape}")
    print(f"Full multi-output kernel shape: {K_full.shape}")
    
    return K_full
 
demonstrate_changepoint()

Additional Advanced Patterns

•Mixture of Experts: $k(x,x') = \sum_i \pi_i(x) \pi_i(x') k_i(x,x')$ — Different kernels for different regions, weighted by gating functions $\pi_i$
•Convolution Kernels: $k(x,x') = \int g(x-u) g(x'-u) du$ — Define a 'smoothing kernel' $g$ and convolve; useful for irregular sensor data
•Additive Models (higher-order): $k(x,x') = k_1(x_1) + k_2(x_2) + k_{12}(x_1,x_2)$ — Main effects + interactions, interpretable in high dimensions
•Polynomial Sums of Kernels: $k = k_1 + k_2 + k_1 k_2$ — Captures both independent and interactive effects
•Deep GPs: Stack GPs by using the output of one GP as input to another — highly expressive non-stationary models

The Kernel Grammar: Automated Structure Discovery

The compositional rules for kernels form a grammar—a set of rules for generating valid expressions. This observation led to the Automatic Statistician project and related work on kernel structure search, which automatically discovers interpretable kernel structures from data.

The Grammar

Define a set of base kernels: {RBF, Linear, Periodic, ...}

Define operations: {+, ×}

The kernel grammar generates all valid kernels by recursive application:

K → Base
K → K + K  
K → K × K

This generates a (countably infinite) set of kernels: RBF, Linear, RBF+Linear, RBF×Periodic, (RBF+Linear)×Periodic, etc.

Interpretation

Each grammar production has interpretable meaning:

$k_1 + k_2$: "$f$ can be decomposed into independent components"
$k_1 \times k_2$: "$k_2$-type structure, but only where $k_1$ permits correlation"
$k_1 + k_1 \times k_2$: "$k_1$ appears everywhere; $k_2$ modulates an additional copy"

Automatic Model Description

A key benefit of the kernel grammar: discovered structures can be translated to natural language! For example, 'Linear + Periodic × RBF' becomes 'a linear trend plus a periodic pattern that decays over time.' This makes GP models interpretable even when automatically discovered.

kernel_grammar.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from itertools import product
 
class KernelExpression:
    """
    Abstract syntax tree for kernel expressions.
    Enables composition and natural language interpretation.
    """
    pass
 
class BaseKernel(KernelExpression):
    def __init__(self, name, description):
        self.name = name
        self.description = description
    
    def __repr__(self):
        return self.name
    
    def describe(self):
        return self.description
 
class SumKernel(KernelExpression):
    def __init__(self, left, right):
        self.left = left
        self.right = right
    
    def __repr__(self):
        return f"({self.left} + {self.right})"
    
    def describe(self):
        return f"{self.left.describe()}, plus independently {self.right.describe()}"
 
class ProductKernel(KernelExpression):
    def __init__(self, left, right):
        self.left = left
        self.right = right
    
    def __repr__(self):
        return f"({self.left} × {self.right})"
    
    def describe(self):
        return f"{self.right.describe()}, but only where {self.left.describe()} applies"
 
# Define base kernels with interpretations
RBF = BaseKernel("RBF", "smooth, locally correlated variations")
LIN = BaseKernel("Linear", "a linear trend")
PER = BaseKernel("Periodic", "repeating periodic patterns")
WN = BaseKernel("WhiteNoise", "independent noise at each point")
 
def generate_depth_1_kernels(bases):
    """Generate all kernels of depth 1 (single base)."""
    return bases.copy()
 
def generate_depth_2_kernels(bases):
    """Generate all kernels of depth 2 (base op base)."""
    results = []
    for b1, b2 in product(bases, repeat=2):
        results.append(SumKernel(b1, b2))
        results.append(ProductKernel(b1, b2))
    return results
 
def demonstrate_grammar():
    """Show how the grammar generates interpretable expressions."""
    print("Kernel Grammar: Compositional Structure Discovery")
    print("="*55)
    print()
    
    bases = [RBF, LIN, PER]
    
    print("Base Kernels:")
    for b in bases:
        print(f"  {b.name}: {b.description}")
    
    print()
    print("Example Composed Kernels and Their Interpretations:")
    print("-"*55)
    
    # Manual examples of useful compositions
    examples = [
        SumKernel(LIN, RBF),
        ProductKernel(PER, RBF),
        SumKernel(LIN, ProductKernel(PER, RBF)),
        ProductKernel(LIN, PER),
        SumKernel(SumKernel(LIN, PER), RBF),
    ]
    
    for expr in examples:
        print(f"\n  Expression: {expr}")
        print(f"  Meaning: The function exhibits {expr.describe()}")
    
    print()
    print("-"*55)
    print("KEY INSIGHT: The grammar enables automatic model search!")
    print("Algorithms can enumerate expressions and select the best via")
    print("marginal likelihood, then explain findings in natural language.")
 
def greedy_kernel_search(X, y, max_depth=3, bases=None):
    """
    Simplified greedy kernel structure search.
    
    Real implementations (like GPy's kernel_search or AutoGP) use:
    - Efficient marginal likelihood computation
    - Beam search or MCTS for exploration
    - Regularization to prevent overfitting
    """
    if bases is None:
        bases = [RBF, LIN, PER]
    
    print("Greedy Kernel Search Algorithm:")
    print("="*40)
    print("1. Start with each base kernel")
    print("2. Evaluate marginal likelihood for current best")
    print("3. Try all single expansions: k+base, k×base")
    print("4. Keep best expansion if it improves")
    print("5. Repeat until max depth or no improvement")
    print()
    print("This finds interpretable structure automatically!")
    
    # Pseudocode for actual implementation:
    """
    current_best = argmax(bases, key=compute_log_ml)
    for depth in range(max_depth):
        candidates = []
        for base in bases:
            candidates.append(SumKernel(current_best, base))
            candidates.append(ProductKernel(current_best, base))
        new_best = argmax(candidates, key=compute_log_ml)
        if log_ml(new_best) <= log_ml(current_best) + threshold:
            break  # No improvement
        current_best = new_best
    return current_best
    """
 
demonstrate_grammar()

Structure Search Algorithms

Several approaches exist for searching the kernel grammar:

Greedy expansion: Start with base kernels, iteratively add the operation that most improves marginal likelihood
Compositional Kernel Search (CKS): Use Bayesian optimization over the discrete space of kernel structures
Neural architecture search adaptations: Treat kernel structure as a differentiable architecture
Monte Carlo Tree Search (MCTS): Explore the tree of possible kernel expressions using UCB-style exploration

Practical Tools:

GPy: kern.add() and kern.prod() for manual composition
GPflow: Similar API with TensorFlow backend
GPyTorch: PyTorch-based, efficient composition and optimization
Automatic Statistician: Research prototype for full automatic model generation

When to Use Automated Search:

Exploratory data analysis where you don't have strong priors
When domain knowledge is insufficient to choose structure
As a starting point that you refine with domain expertise
For generating hypotheses about data structure

Practical Kernel Design Workflow

Designing kernels for a real problem requires balancing domain knowledge, data exploration, and principled model comparison. This section provides a practical workflow.

Kernel Design Workflow

•Understand the domain: What physical/business process generates the data? Are there known periodicities? Trends? Varying scales of variation?
•Visualize thoroughly: Plot raw data, autocorrelation, spectral density. Look for patterns that suggest kernel structure.
•Start simple: Begin with Matérn-5/2 (or RBF). Fit and examine residuals for systematic structure.
•Add components incrementally: If residuals show periodicity, add Periodic. If there's trend, add Linear. Document each addition.
•Compare with marginal likelihood: Log marginal likelihood (LML) penalizes complexity. Prefer higher LML with simpler kernels.
•Cross-validate predictions: LML can overfit with many hyperparameters. Use held-out predictive log-likelihood to validate.
•Interpret and document: What does your final kernel structure mean? Can you explain it to a domain expert?
•Sensitivity analysis: How do predictions change with small changes in kernel structure? Stable predictions are more trustworthy.

Signs of Good Kernel Design

✓ Residuals appear i.i.d. (no spatial/temporal patterns) ✓ Predictions are sensible extrapolations ✓ Uncertainty grows appropriately in data-sparse regions ✓ Kernel structure has a plausible interpretation ✓ Cross-validation confirms training-set LML

Warning Signs

✗ Many hyperparameters relative to data points ✗ Very small length-scales (overfitting) ✗ Predictions collapse or explode outside training range ✗ Cross-validation much worse than training LML ✗ Kernel structure uninterpretable

Common Mistakes and How to Avoid Them

Mistake	Symptom	Solution
Too many components	Poor cross-validation, many small hyperparameters	Remove least important components
Ignoring periodicity	Periodic residuals, poor forecasting	Add Periodic×RBF component
Wrong periodic frequency	Resonance patterns in predictions	Check data for true period, try learning p
Rigid periodicity	Mismatch between model and evolving patterns	Use Locally Periodic (Periodic×RBF)
Linear without bound	Unrealistic extrapolation	Cap Linear with saturating function or RBF
Ignoring non-stationarity	Spatially-varying residual patterns	Add Gibbs kernel or input warping

Final Advice: Kernel design is iterative. Expect to go through several cycles of 'try → evaluate → refine'. Document your reasoning so you (and others) can understand the final model later.

Summary: Kernel Composition

Kernel composition is the key to building expressive, interpretable GP models. By combining simple building blocks, we can encode sophisticated prior knowledge about function behavior.

Key Takeaways

•Closure properties: Addition, multiplication, positive scaling preserve PSD; subtraction does not
•Addition models independent effects: f = f₁ + f₂ + ... with posterior decomposition
•Multiplication models interactions: One kernel can 'gate' or 'modulate' another
•Locally Periodic: Periodic×RBF captures patterns that repeat but don't persist eternally
•ARD kernel: Per-dimension length-scales enable automatic feature selection
•Changepoint kernels: Model regime switches with smooth transitions
•Kernel grammar: Compositions form an interpretable language; enables automated discovery
•Practical workflow: Start simple, add incrementally, validate with cross-validation

What's Next:

We've seen how to build kernels from components. The next page explores spectral analysis—understanding kernels through their Fourier representations. This perspective provides deep insight into what stationary kernels encode and enables the powerful Spectral Mixture Kernel that can approximate any stationary kernel.

Page Complete

You now understand kernel composition: the mathematical foundations, additive vs multiplicative patterns, advanced compositions like changepoints, and the kernel grammar for automated discovery. This toolkit enables you to design GP models that match the true structure of your data.

3 / 5

Loading learning content...

Machine LearningGaussian Processes

Kernel Design for Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

3 / 5

Kernel Composition

The Algebra of Kernels

This page explores the mathematical foundations and practical techniques for kernel composition—a skill that separates novice GP practitioners from experts.

What You Will Learn

Closure Properties of Positive Semi-Definite Kernels

Fundamental Closure Properties

Let $k_1$ and $k_2$ be valid (PSD) kernels, and let $c > 0$ be a positive constant. The following operations yield valid kernels:

Operation	Formula	Valid Kernel?
Addition	$k(x, x') = k_1(x, x') + k_2(x, x')$	✓ Yes
Scalar multiplication	$k(x, x') = c \cdot k_1(x, x')$	✓ Yes
Multiplication (product)	$k(x, x') = k_1(x, x') \cdot k_2(x, x')$	✓ Yes
Point-wise limit	$k(x, x') = \lim_{n \to \infty} k_n(x, x')$	✓ Yes (if limit exists)
Exponentiation	$k(x, x') = \exp(k_1(x, x'))$	✓ Yes
Polynomial with positive coefficients	$k(x, x') = \sum_i a_i k_1(x, x')^i$ for $a_i \geq 0$	✓ Yes
Function composition	$k(x, x') = k_1(g(x), g(x'))$ for any function $g$	✓ Yes
Linear combination	$k(x, x') = \alpha k_1 + \beta k_2$ for $\alpha, \beta > 0$	✓ Yes

Operations That Don't Preserve Validity

Why Do These Properties Hold?

Exponentiation: Follows from the power series $\exp(k) = \sum_{n=0}^{\infty} k^n/n!$, where each term $k^n$ is PSD (by repeated multiplication), and the sum converges.

These properties give us a grammar for building kernels: combine primitives (RBF, Matérn, Linear, Periodic) using valid operations to construct arbitrarily complex covariance structures.

kernel_validity_check.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from numpy.linalg import eigvalsh
 
def check_psd(K, tolerance=1e-8):
    """
    Check if a matrix is positive semi-definite.
    
    A matrix K is PSD if all eigenvalues are >= 0.
    In practice, we allow small negative eigenvalues due to numerical error.
    """
    eigenvalues = eigvalsh(K)  # Eigenvalues of symmetric matrix
    min_eig = np.min(eigenvalues)
    is_psd = min_eig >= -tolerance
    return is_psd, min_eig
 
def demonstrate_closure_properties():
    """
    Demonstrate that kernel operations preserve PSD property.
    """
    # Generate random test points
    np.random.seed(42)
    n = 50
    X = np.random.randn(n, 2)
    
    # Define two base kernels
    def rbf_kernel(X, length_scale=1.0):
        sq_dists = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :])**2, axis=2)
        return np.exp(-sq_dists / (2 * length_scale**2))
    
    def linear_kernel(X):
        return X @ X.T
    
    K_rbf = rbf_kernel(X, length_scale=1.0)
    K_linear = linear_kernel(X)
    
    print("Base kernels:")
    is_psd, min_eig = check_psd(K_rbf)
    print(f"  RBF kernel: PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    is_psd, min_eig = check_psd(K_linear)
    print(f"  Linear kernel: PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test addition
    K_sum = K_rbf + K_linear
    is_psd, min_eig = check_psd(K_sum)
    print(f"\nAddition (RBF + Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test multiplication (Hadamard product)
    K_prod = K_rbf * K_linear
    is_psd, min_eig = check_psd(K_prod)
    print(f"Multiplication (RBF × Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test scalar multiplication
    K_scaled = 5.0 * K_rbf
    is_psd, min_eig = check_psd(K_scaled)
    print(f"Scalar multiplication (5 × RBF): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # Test exponentiation
    K_exp = np.exp(K_rbf)  # Note: exp of a valid kernel, not exp(-dist^2)
    is_psd, min_eig = check_psd(K_exp)
    print(f"Exponentiation (exp(RBF)): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    
    # WARNING: Test subtraction (NOT valid operation)
    K_diff = K_rbf - 0.5 * K_linear
    is_psd, min_eig = check_psd(K_diff)
    print(f"\n⚠️ Subtraction (RBF - 0.5×Linear): PSD={is_psd}, min eigenvalue={min_eig:.2e}")
    print("   Note: Subtraction does NOT preserve PSD property!")
    
demonstrate_closure_properties()

Kernel Addition: Modeling Independent Effects

Kernel addition is the most common composition operation. When we add kernels, we model the function as a sum of independent components, each with its own characteristic structure.

Mathematical Interpretation

If $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$, then samples from the GP can be written as:

$$f(\mathbf{x}) = f_1(\mathbf{x}) + f_2(\mathbf{x})$$

where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent Gaussian Processes.

Intuition: Each component contributes independently to the overall function. The posterior can decompose contributions, attributing variance to each source.

Common Patterns Using Addition

Additive Kernel Compositions
Pattern	Composition	Interpretation	Example Application
Trend + Residual	Linear + Matérn	Systematic trend plus correlated noise	Economic time series, growth models
Multi-scale smoothness	RBF(ℓ₁) + RBF(ℓ₂)	Long-range + short-range correlations	Spatial fields with multiple scales
Signal + Noise	Matérn + White	Smooth signal plus i.i.d. observation noise	Standard regression setup
Periodic + Aperiodic	Periodic + RBF	Cyclic pattern plus smooth deviations	Seasonal data with anomalies
Global + Local	Linear + Local Periodic	Overall direction plus local oscillations	Stock prices, climate data

additive_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_kernel(X1, X2, sigma_sq=1.0, length_scale=1.0):
    """Standard RBF (Squared Exponential) kernel."""
    X1, X2 = np.atleast_2d(X1).T, np.atleast_2d(X2).T
    sq_dists = np.sum((X1[:, np.newaxis] - X2.T[np.newaxis, :])**2, axis=2)
    return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
 
def linear_kernel(X1, X2, sigma_sq=1.0, offset=0.0):
    """Linear (Bayesian linear regression) kernel."""
    X1 = np.atleast_1d(X1).flatten() - offset
    X2 = np.atleast_1d(X2).flatten() - offset
    return sigma_sq * np.outer(X1, X2)
 
def periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, length_scale=1.0):
    """Periodic kernel for cyclic patterns."""
    X1, X2 = np.atleast_1d(X1).flatten(), np.atleast_1d(X2).flatten()
    dists = np.abs(np.subtract.outer(X1, X2))
    return sigma_sq * np.exp(-2 * (np.sin(np.pi * dists / period) / length_scale)**2)
 
def white_noise_kernel(n, sigma_sq=0.1):
    """White noise (diagonal) kernel for observation noise."""
    return sigma_sq * np.eye(n)
 
def demonstrate_additive_decomposition():
    """
    Demonstrate how additive kernels decompose into interpretable components.
    """
    X = np.linspace(0, 10, 200)
    n = len(X)
    
    # Define component kernels
    K_trend = linear_kernel(X, X, sigma_sq=0.5, offset=5.0)
    K_smooth = rbf_kernel(X, X, sigma_sq=1.0, length_scale=2.0).reshape(n, n)
    K_periodic = periodic_kernel(X, X, sigma_sq=0.5, period=2.0, length_scale=0.5)
    K_noise = white_noise_kernel(n, sigma_sq=0.1)
    
    # Combined kernel
    K_total = K_trend + K_smooth + K_periodic + K_noise
    K_total += 1e-6 * np.eye(n)  # Numerical stability
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    np.random.seed(42)
    
    # Sample each component independently
    components = {
        'Trend (Linear)': K_trend,
        'Smooth (RBF)': K_smooth,
        'Periodic': K_periodic,
    }
    
    samples = {}
    col = 0
    for name, K in components.items():
        K_stable = K + 1e-6 * np.eye(n)
        L = np.linalg.cholesky(K_stable)
        z = np.random.randn(n)  # Save for reuse
        samples[name] = L @ z
        
        axes[0, col].plot(X, samples[name], 'b-', linewidth=2)
        axes[0, col].set_title(f'Component: {name}')
        axes[0, col].set_xlabel('x')
        axes[0, col].set_ylabel('f(x)')
        axes[0, col].grid(True, alpha=0.3)
        col += 1
    
    # Combined sample (sum of components + noise)
    noise = np.sqrt(0.1) * np.random.randn(n)
    f_total = samples['Trend (Linear)'] + samples['Smooth (RBF)'] + samples['Periodic'] + noise
    
    axes[1, 0].plot(X, f_total, 'k-', linewidth=2)
    axes[1, 0].set_title('Combined Function: Trend + Smooth + Periodic + Noise')
    axes[1, 0].set_xlabel('x')
    axes[1, 0].set_ylabel('f(x)')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Show the stacked contributions
    axes[1, 1].fill_between(X, 0, samples['Trend (Linear)'], alpha=0.7, label='Trend')
    axes[1, 1].fill_between(X, samples['Trend (Linear)'], 
                            samples['Trend (Linear)'] + samples['Smooth (RBF)'], 
                            alpha=0.7, label='Smooth')
    axes[1, 1].fill_between(X, samples['Trend (Linear)'] + samples['Smooth (RBF)'],
                            samples['Trend (Linear)'] + samples['Smooth (RBF)'] + samples['Periodic'],
                            alpha=0.7, label='Periodic')
    axes[1, 1].set_title('Stacked Component Contributions')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # Covariance matrix of combined kernel
    im = axes[1, 2].imshow(K_total, extent=[0, 10, 10, 0], cmap='viridis')
    axes[1, 2].set_title('Combined Covariance Matrix')
    axes[1, 2].set_xlabel("x'")
    axes[1, 2].set_ylabel('x')
    plt.colorbar(im, ax=axes[1, 2])
    
    plt.suptitle('Additive Kernel Composition: Decomposing Complex Functions', fontsize=14)
    plt.tight_layout()
    return fig
 
def posterior_decomposition_example():
    """
    Show how posterior inference decomposes additive components.
    After observing data, we can attribute variance to each source.
    """
    # Generate synthetic data from known components
    np.random.seed(42)
    X_train = np.sort(np.random.uniform(0, 10, 30))
    
    # True components
    true_trend = 0.5 * (X_train - 5)
    true_periodic = 1.5 * np.sin(2 * np.pi * X_train / 3)
    true_smooth = np.zeros_like(X_train)  # RBF contribution
    noise = 0.5 * np.random.randn(len(X_train))
    
    y_train = true_trend + true_periodic + noise
    
    # Fit GP with additive kernel and decompose posterior
    # (Full implementation would compute posterior for each component)
    
    print("Posterior decomposition allows:")
    print("1. Estimating the trend component separately from oscillations")
    print("2. Quantifying uncertainty in each component")
    print("3. Forecasting each component independently")
    print("4. Identifying which component explains most variance")
    
    return X_train, y_train
 
demonstrate_additive_decomposition()

Posterior Decomposition

Kernel Multiplication: Modeling Interactions

Mathematical Interpretation

Intuition: Think of $k_1$ as a 'gate' for $k_2$. If $k_1(\mathbf{x}, \mathbf{x}') = 0$, then the product is zero regardless of $k_2$. This creates sophisticated dependency structures.

Key Product Kernel Patterns

Multiplicative Kernel Compositions
Pattern	Composition	Effect	Use Case
Locally Periodic	Periodic × RBF	Periodicity that fades with distance	Quasi-periodic phenomena, music, speech
Localized Features	RBF(small ℓ) × RBF(large ℓ)	Features only exist within a region	Spatial heterogeneity
Damped Linear	Linear × RBF	Trend that applies only locally	Local trends in non-stationary data
Tensor Product	k₁(x₁)×k₂(x₂)	Separable multi-dimensional kernel	Space-time, multi-output GPs
ARD (Automatic Relevance Determination)	∏ᵢ k(xᵢ)	Different length-scale per dimension	Feature selection, high-dimensional data

multiplicative_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
import matplotlib.pyplot as plt
 
def locally_periodic_kernel(X1, X2, sigma_sq=1.0, period=1.0, 
                            length_scale_periodic=0.5, length_scale_decay=5.0):
    """
    Locally Periodic = Periodic × RBF
    
    Creates patterns that repeat but fade with distance.
    The RBF acts as an envelope that modulates the periodic structure.
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    # Periodic component
    dists = np.abs(np.subtract.outer(X1, X2))
    K_periodic = np.exp(-2 * (np.sin(np.pi * dists / period) / length_scale_periodic)**2)
    
    # RBF decay envelope
    sq_dists = np.subtract.outer(X1, X2)**2
    K_rbf = np.exp(-sq_dists / (2 * length_scale_decay**2))
    
    # Product kernel
    return sigma_sq * K_periodic * K_rbf
 
def demonstrate_product_vs_sum():
    """
    Compare additive vs multiplicative composition for Periodic + RBF.
    """
    X = np.linspace(0, 20, 300)
    n = len(X)
    
    # Define base kernels
    def periodic(X1, X2, period=3.0, ell=0.5):
        dists = np.abs(np.subtract.outer(X1, X2))
        return np.exp(-2 * (np.sin(np.pi * dists / period) / ell)**2)
    
    def rbf(X1, X2, ell=5.0):
        sq_dists = np.subtract.outer(X1, X2)**2
        return np.exp(-sq_dists / (2 * ell**2))
    
    K_per = periodic(X, X)
    K_rbf = rbf(X, X)
    
    # Additive: Periodic + RBF
    K_sum = K_per + K_rbf
    
    # Multiplicative: Periodic × RBF
    K_prod = K_per * K_rbf
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    np.random.seed(42)
    
    # Sample from periodic kernel
    L_per = np.linalg.cholesky(K_per + 1e-6*np.eye(n))
    f_per = L_per @ np.random.randn(n)
    axes[0, 0].plot(X, f_per)
    axes[0, 0].set_title('Periodic Kernel Samples')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Sample from RBF kernel
    L_rbf = np.linalg.cholesky(K_rbf + 1e-6*np.eye(n))
    f_rbf = L_rbf @ np.random.randn(n)
    axes[0, 1].plot(X, f_rbf)
    axes[0, 1].set_title('RBF Kernel Samples')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Covariance comparison
    axes[0, 2].plot(X, K_per[n//2, :], label='Periodic', linewidth=2)
    axes[0, 2].plot(X, K_rbf[n//2, :], label='RBF', linewidth=2)
    axes[0, 2].axvline(X[n//2], color='gray', linestyle='--', alpha=0.5)
    axes[0, 2].set_title('Covariance from Midpoint')
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # Sample from SUM kernel (Periodic + RBF)
    L_sum = np.linalg.cholesky(K_sum + 1e-6*np.eye(n))
    for i in range(5):
        f_sum = L_sum @ np.random.randn(n)
        axes[1, 0].plot(X, f_sum, alpha=0.7)
    axes[1, 0].set_title('SUM: Periodic + RBF\n(Independent components)')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Sample from PRODUCT kernel (Periodic × RBF)
    L_prod = np.linalg.cholesky(K_prod + 1e-6*np.eye(n))
    for i in range(5):
        f_prod = L_prod @ np.random.randn(n)
        axes[1, 1].plot(X, f_prod, alpha=0.7)
    axes[1, 1].set_title('PRODUCT: Periodic × RBF\n(Locally periodic = decaying oscillations)')
    axes[1, 1].grid(True, alpha=0.3)
    
    # Covariance comparison for composed kernels
    axes[1, 2].plot(X, K_sum[n//2, :], label='Sum', linewidth=2)
    axes[1, 2].plot(X, K_prod[n//2, :], label='Product', linewidth=2)
    axes[1, 2].axvline(X[n//2], color='gray', linestyle='--', alpha=0.5)
    axes[1, 2].set_title('Composed Kernels: Covariance from Midpoint')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.suptitle('Addition vs Multiplication: Fundamentally Different Behaviors', fontsize=14)
    plt.tight_layout()
    return fig
 
def ard_kernel_example():
    """
    Automatic Relevance Determination: Product of per-dimension kernels.
    
    This is crucial for feature selection in high-dimensional problems.
    """
    print("ARD Kernel (Automatic Relevance Determination):")
    print("="*50)
    print()
    print("k_ARD(x, x') = σ² · exp(-Σᵢ (xᵢ - x'ᵢ)² / (2ℓᵢ²))")
    print()
    print("Each dimension i has its own length-scale ℓᵢ")
    print()
    print("Feature Selection Interpretation:")
    print("- Large ℓᵢ → dimension i is irrelevant (flat kernel)")
    print("- Small ℓᵢ → dimension i is highly relevant (sensitive)")
    print()
    print("In practice: optimize all ℓᵢ via marginal likelihood")
    print("Irrelevant features get ℓᵢ → ∞ automatically")
    
    # Demonstrate in 2D with different relevances
    np.random.seed(42)
    n = 100
    X = np.random.randn(n, 2)
    
    # Dimension 1 relevant (small length-scale), Dimension 2 irrelevant (large)
    length_scales = np.array([0.5, 10.0])
    
    def ard_kernel(X1, X2, length_scales, sigma_sq=1.0):
        """ARD-RBF kernel."""
        X1_scaled = X1 / length_scales
        X2_scaled = X2 / length_scales
        sq_dists = np.sum((X1_scaled[:, np.newaxis, :] - X2_scaled[np.newaxis, :, :])**2, axis=2)
        return sigma_sq * np.exp(-sq_dists / 2)
    
    K = ard_kernel(X, X, length_scales)
    
    print(f"\nExample: length_scales = {length_scales}")
    print(f"Dimension 1 (ℓ={length_scales[0]}): RELEVANT")
    print(f"Dimension 2 (ℓ={length_scales[1]}): IRRELEVANT")
    
    return K
 
demonstrate_product_vs_sum()

The Locally Periodic Kernel in Detail

The locally periodic pattern deserves special attention as it's one of the most useful compositions:

$$k_{\text{LP}}(x, x') = k_{\text{Periodic}}(x, x') \cdot k_{\text{RBF}}(x, x')$$

This models patterns that:

Have a repeating structure (set by the period $p$)
Are only correlated over a limited range (set by the RBF length-scale)

Applications:

Audio signals where pitch repeats but timbre changes
Seasonal effects that don't persist identically year-over-year
Heartbeat data where the rhythm is regular but exact shape varies
Economic cycles that repeat but evolve over time

The RBF component acts as a correlation envelope: patterns can repeat within this envelope, but distant points are uncorrelated regardless of their periodic relationship.

Advanced Composition Patterns

Beyond simple addition and multiplication, sophisticated compositions address specific modeling challenges. This section covers patterns that arise frequently in real applications.

Pattern 1: Change Points

Model functions that behave differently before and after a change point $c$:

$$k_{\text{CP}}(x, x') = k_1(x, x') \cdot \sigma(x)\sigma(x') + k_2(x, x') \cdot (1-\sigma(x))(1-\sigma(x'))$$

Where $\sigma(x) = \text{sigmoid}((x - c)/s)$ is a smooth transition function.

Before $c$: behavior governed by $k_1$ After $c$: behavior governed by $k_2$ Near $c$: smooth interpolation

Pattern 2: Hierarchical / Multi-Level Structure

For data with nested structure (e.g., measurements within subjects within groups):

$$k(x, x') = k_{\text{group}}(g(x), g(x')) + k_{\text{subject}}(s(x), s(x')) + k_{\text{residual}}(x, x')$$

Where $g(x)$ and $s(x)$ extract group and subject indicators.

advanced_compositions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit  # Sigmoid function
 
def changepoint_kernel(X1, X2, k1_fn, k2_fn, changepoint=5.0, steepness=2.0):
    """
    Changepoint kernel: different behavior before/after a transition.
    
    k_CP(x,x') = k₁(x,x')·σ(x)σ(x') + k₂(x,x')·(1-σ(x))(1-σ(x'))
    
    Parameters:
    -----------
    k1_fn, k2_fn : callables
        Kernel functions for before/after changepoint
    changepoint : float
        Location of the transition
    steepness : float
        How sharp the transition is (higher = sharper)
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    # Transition function (sigmoid)
    sigma1 = expit(steepness * (X1 - changepoint))  # n1
    sigma2 = expit(steepness * (X2 - changepoint))  # n2
    
    # Create weight matrices
    w1 = np.outer(sigma1, sigma2)          # After changepoint
    w2 = np.outer(1 - sigma1, 1 - sigma2)  # Before changepoint
    
    # Combine kernels
    K1 = k1_fn(X1, X2)
    K2 = k2_fn(X1, X2)
    
    return K1 * w2 + K2 * w1
 
def demonstrate_changepoint():
    """
    Show how a changepoint kernel transitions between regimes.
    """
    X = np.linspace(0, 10, 200)
    n = len(X)
    changepoint = 5.0
    
    # Define two different kernels for before/after
    def k_smooth(X1, X2):
        """Smooth, slowly-varying behavior."""
        sq_dists = np.subtract.outer(X1, X2)**2
        return np.exp(-sq_dists / (2 * 2.0**2))  # Large length-scale
    
    def k_wiggly(X1, X2):
        """Rapidly varying behavior."""
        sq_dists = np.subtract.outer(X1, X2)**2
        return np.exp(-sq_dists / (2 * 0.3**2))  # Small length-scale
    
    K_cp = changepoint_kernel(X, X, k_smooth, k_wiggly, changepoint=5.0, steepness=2.0)
    K_cp += 1e-6 * np.eye(n)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Samples from changepoint kernel
    np.random.seed(42)
    L = np.linalg.cholesky(K_cp)
    for i in range(5):
        sample = L @ np.random.randn(n)
        axes[0].plot(X, sample, alpha=0.7)
    axes[0].axvline(changepoint, color='red', linestyle='--', label='Changepoint')
    axes[0].set_title('Changepoint Kernel: Smooth → Wiggly')
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Covariance matrix
    im = axes[1].imshow(K_cp, extent=[0, 10, 10, 0], cmap='viridis')
    axes[1].axhline(changepoint, color='red', linestyle='--')
    axes[1].axvline(changepoint, color='red', linestyle='--')
    axes[1].set_title('Covariance Matrix')
    axes[1].set_xlabel("x'")
    axes[1].set_ylabel('x')
    plt.colorbar(im, ax=axes[1])
    
    # Transition weights
    sigma = expit(2.0 * (X - changepoint))
    axes[2].plot(X, 1 - sigma, 'b-', linewidth=2, label='Weight for k₁ (smooth)')
    axes[2].plot(X, sigma, 'r-', linewidth=2, label='Weight for k₂ (wiggly)')
    axes[2].axvline(changepoint, color='gray', linestyle='--', alpha=0.5)
    axes[2].set_title('Transition Weights')
    axes[2].set_xlabel('x')
    axes[2].set_ylabel('Weight')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.suptitle('Changepoint Kernel: Regime-Switching Behavior', fontsize=14)
    plt.tight_layout()
    return fig
 
def multi_output_kernel():
    """
    Multi-output GPs using kernel composition.
    
    For modeling multiple related outputs (y₁, y₂, ...) jointly:
    K(x,x') = B ⊗ k(x,x')
    
    Where B is a positive semi-definite matrix of output correlations.
    """
    print("Multi-Output (Vector-Valued) GPs via Kernel Composition")
    print("="*55)
    print()
    print("For D outputs, construct a DxD output correlation matrix B:")
    print("  B[i,j] = correlation between output i and output j")
    print()
    print("The full kernel is a Kronecker product:")
    print("  K_full((x,i), (x',j)) = B[i,j] × k_base(x, x')")
    print()
    print("This models:")
    print("  - Each output as a GP with kernel k_base")
    print("  - Cross-output correlations via B")
    print()
    print("Example: Modeling temperature and humidity jointly")
    print("  - k_base = Matérn captures spatial correlation")
    print("  - B captures that temperature and humidity are related")
    
    # Simple example: 2 outputs
    n_locations = 5
    n_outputs = 2
    
    # Correlation between outputs
    B = np.array([[1.0, 0.7],
                  [0.7, 1.0]])
    
    # Base kernel for locations
    X = np.array([0, 1, 2, 3, 4])
    K_base = np.exp(-np.subtract.outer(X, X)**2)
    
    # Full kernel via Kronecker product
    K_full = np.kron(B, K_base)
    
    print(f"\nBase kernel shape: {K_base.shape}")
    print(f"Output correlation matrix B shape: {B.shape}")
    print(f"Full multi-output kernel shape: {K_full.shape}")
    
    return K_full
 
demonstrate_changepoint()

Additional Advanced Patterns

•Mixture of Experts: $k(x,x') = \sum_i \pi_i(x) \pi_i(x') k_i(x,x')$ — Different kernels for different regions, weighted by gating functions $\pi_i$
•Convolution Kernels: $k(x,x') = \int g(x-u) g(x'-u) du$ — Define a 'smoothing kernel' $g$ and convolve; useful for irregular sensor data
•Additive Models (higher-order): $k(x,x') = k_1(x_1) + k_2(x_2) + k_{12}(x_1,x_2)$ — Main effects + interactions, interpretable in high dimensions
•Polynomial Sums of Kernels: $k = k_1 + k_2 + k_1 k_2$ — Captures both independent and interactive effects
•Deep GPs: Stack GPs by using the output of one GP as input to another — highly expressive non-stationary models

The Kernel Grammar: Automated Structure Discovery

The Grammar

Define a set of base kernels: {RBF, Linear, Periodic, ...}

Define operations: {+, ×}

The kernel grammar generates all valid kernels by recursive application:

K → Base
K → K + K  
K → K × K

This generates a (countably infinite) set of kernels: RBF, Linear, RBF+Linear, RBF×Periodic, (RBF+Linear)×Periodic, etc.

Interpretation

Each grammar production has interpretable meaning:

$k_1 + k_2$: "$f$ can be decomposed into independent components"
$k_1 \times k_2$: "$k_2$-type structure, but only where $k_1$ permits correlation"
$k_1 + k_1 \times k_2$: "$k_1$ appears everywhere; $k_2$ modulates an additional copy"

Automatic Model Description

kernel_grammar.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from itertools import product
 
class KernelExpression:
    """
    Abstract syntax tree for kernel expressions.
    Enables composition and natural language interpretation.
    """
    pass
 
class BaseKernel(KernelExpression):
    def __init__(self, name, description):
        self.name = name
        self.description = description
    
    def __repr__(self):
        return self.name
    
    def describe(self):
        return self.description
 
class SumKernel(KernelExpression):
    def __init__(self, left, right):
        self.left = left
        self.right = right
    
    def __repr__(self):
        return f"({self.left} + {self.right})"
    
    def describe(self):
        return f"{self.left.describe()}, plus independently {self.right.describe()}"
 
class ProductKernel(KernelExpression):
    def __init__(self, left, right):
        self.left = left
        self.right = right
    
    def __repr__(self):
        return f"({self.left} × {self.right})"
    
    def describe(self):
        return f"{self.right.describe()}, but only where {self.left.describe()} applies"
 
# Define base kernels with interpretations
RBF = BaseKernel("RBF", "smooth, locally correlated variations")
LIN = BaseKernel("Linear", "a linear trend")
PER = BaseKernel("Periodic", "repeating periodic patterns")
WN = BaseKernel("WhiteNoise", "independent noise at each point")
 
def generate_depth_1_kernels(bases):
    """Generate all kernels of depth 1 (single base)."""
    return bases.copy()
 
def generate_depth_2_kernels(bases):
    """Generate all kernels of depth 2 (base op base)."""
    results = []
    for b1, b2 in product(bases, repeat=2):
        results.append(SumKernel(b1, b2))
        results.append(ProductKernel(b1, b2))
    return results
 
def demonstrate_grammar():
    """Show how the grammar generates interpretable expressions."""
    print("Kernel Grammar: Compositional Structure Discovery")
    print("="*55)
    print()
    
    bases = [RBF, LIN, PER]
    
    print("Base Kernels:")
    for b in bases:
        print(f"  {b.name}: {b.description}")
    
    print()
    print("Example Composed Kernels and Their Interpretations:")
    print("-"*55)
    
    # Manual examples of useful compositions
    examples = [
        SumKernel(LIN, RBF),
        ProductKernel(PER, RBF),
        SumKernel(LIN, ProductKernel(PER, RBF)),
        ProductKernel(LIN, PER),
        SumKernel(SumKernel(LIN, PER), RBF),
    ]
    
    for expr in examples:
        print(f"\n  Expression: {expr}")
        print(f"  Meaning: The function exhibits {expr.describe()}")
    
    print()
    print("-"*55)
    print("KEY INSIGHT: The grammar enables automatic model search!")
    print("Algorithms can enumerate expressions and select the best via")
    print("marginal likelihood, then explain findings in natural language.")
 
def greedy_kernel_search(X, y, max_depth=3, bases=None):
    """
    Simplified greedy kernel structure search.
    
    Real implementations (like GPy's kernel_search or AutoGP) use:
    - Efficient marginal likelihood computation
    - Beam search or MCTS for exploration
    - Regularization to prevent overfitting
    """
    if bases is None:
        bases = [RBF, LIN, PER]
    
    print("Greedy Kernel Search Algorithm:")
    print("="*40)
    print("1. Start with each base kernel")
    print("2. Evaluate marginal likelihood for current best")
    print("3. Try all single expansions: k+base, k×base")
    print("4. Keep best expansion if it improves")
    print("5. Repeat until max depth or no improvement")
    print()
    print("This finds interpretable structure automatically!")
    
    # Pseudocode for actual implementation:
    """
    current_best = argmax(bases, key=compute_log_ml)
    for depth in range(max_depth):
        candidates = []
        for base in bases:
            candidates.append(SumKernel(current_best, base))
            candidates.append(ProductKernel(current_best, base))
        new_best = argmax(candidates, key=compute_log_ml)
        if log_ml(new_best) <= log_ml(current_best) + threshold:
            break  # No improvement
        current_best = new_best
    return current_best
    """
 
demonstrate_grammar()

Structure Search Algorithms

Several approaches exist for searching the kernel grammar:

Greedy expansion: Start with base kernels, iteratively add the operation that most improves marginal likelihood
Compositional Kernel Search (CKS): Use Bayesian optimization over the discrete space of kernel structures
Neural architecture search adaptations: Treat kernel structure as a differentiable architecture
Monte Carlo Tree Search (MCTS): Explore the tree of possible kernel expressions using UCB-style exploration

Practical Tools:

GPy: kern.add() and kern.prod() for manual composition
GPflow: Similar API with TensorFlow backend
GPyTorch: PyTorch-based, efficient composition and optimization
Automatic Statistician: Research prototype for full automatic model generation

When to Use Automated Search:

Exploratory data analysis where you don't have strong priors
When domain knowledge is insufficient to choose structure
As a starting point that you refine with domain expertise
For generating hypotheses about data structure

Practical Kernel Design Workflow

Designing kernels for a real problem requires balancing domain knowledge, data exploration, and principled model comparison. This section provides a practical workflow.

Kernel Design Workflow

•Understand the domain: What physical/business process generates the data? Are there known periodicities? Trends? Varying scales of variation?
•Visualize thoroughly: Plot raw data, autocorrelation, spectral density. Look for patterns that suggest kernel structure.
•Start simple: Begin with Matérn-5/2 (or RBF). Fit and examine residuals for systematic structure.
•Add components incrementally: If residuals show periodicity, add Periodic. If there's trend, add Linear. Document each addition.
•Compare with marginal likelihood: Log marginal likelihood (LML) penalizes complexity. Prefer higher LML with simpler kernels.
•Cross-validate predictions: LML can overfit with many hyperparameters. Use held-out predictive log-likelihood to validate.
•Interpret and document: What does your final kernel structure mean? Can you explain it to a domain expert?
•Sensitivity analysis: How do predictions change with small changes in kernel structure? Stable predictions are more trustworthy.

Signs of Good Kernel Design

Warning Signs

Common Mistakes and How to Avoid Them

Mistake	Symptom	Solution
Too many components	Poor cross-validation, many small hyperparameters	Remove least important components
Ignoring periodicity	Periodic residuals, poor forecasting	Add Periodic×RBF component
Wrong periodic frequency	Resonance patterns in predictions	Check data for true period, try learning p
Rigid periodicity	Mismatch between model and evolving patterns	Use Locally Periodic (Periodic×RBF)
Linear without bound	Unrealistic extrapolation	Cap Linear with saturating function or RBF
Ignoring non-stationarity	Spatially-varying residual patterns	Add Gibbs kernel or input warping

Final Advice: Kernel design is iterative. Expect to go through several cycles of 'try → evaluate → refine'. Document your reasoning so you (and others) can understand the final model later.

Summary: Kernel Composition

Kernel composition is the key to building expressive, interpretable GP models. By combining simple building blocks, we can encode sophisticated prior knowledge about function behavior.

Key Takeaways

•Closure properties: Addition, multiplication, positive scaling preserve PSD; subtraction does not
•Addition models independent effects: f = f₁ + f₂ + ... with posterior decomposition
•Multiplication models interactions: One kernel can 'gate' or 'modulate' another
•Locally Periodic: Periodic×RBF captures patterns that repeat but don't persist eternally
•ARD kernel: Per-dimension length-scales enable automatic feature selection
•Changepoint kernels: Model regime switches with smooth transitions
•Kernel grammar: Compositions form an interpretable language; enables automated discovery
•Practical workflow: Start simple, add incrementally, validate with cross-validation

What's Next:

Page Complete

3 / 5