Machine LearningGP Kernels and Hyperparameters

Gaussian Process Kernels and Hyperparameter Optimization

LevelAdvanced

Duration90 mins

TopicGP Kernels and Hyperparameters

1 / 5

Kernel Design: The Foundation of Gaussian Process Modeling

The Language of Assumptions

In Gaussian Process regression, the kernel function (also called the covariance function) is arguably the single most important modeling decision you will make. While the GP framework provides the probabilistic machinery for inference and uncertainty quantification, it is the kernel that encodes your prior beliefs about the function you are trying to learn.

Think of the kernel as the language you use to communicate with the learning algorithm. Through the kernel, you tell the GP:

How smooth should the function be?
Should nearby points have similar values?
Are there periodic patterns in the data?
Do different input dimensions have different relevance?
Should the function vary smoothly everywhere, or allow for discontinuities?

What You Will Learn

By the end of this page, you will understand the mathematical requirements for valid kernels, the key properties that kernels encode, the most important kernel families used in practice, and the principles for designing kernels that capture your domain knowledge. You will see how a well-chosen kernel can make the difference between a GP that captures the true underlying function and one that fails entirely.

Formal Definition and Requirements

A kernel function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ defines the covariance between function values at any two input points. For a Gaussian Process $f \sim \mathcal{GP}(m, k)$ with mean function $m$ and kernel $k$:

$$\text{Cov}(f(\mathbf{x}), f(\mathbf{x}')) = k(\mathbf{x}, \mathbf{x}')$$

This seemingly simple definition carries profound implications. The kernel completely determines the prior distribution over functions before any data is observed. But not every function $k(\mathbf{x}, \mathbf{x}')$ qualifies as a valid kernel.

The Positive Semi-Definite Requirement

For a function to be a valid kernel, it must be positive semi-definite (PSD). This means that for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n}$, the resulting kernel matrix $K$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semi-definite: $\mathbf{v}^T K \mathbf{v} \geq 0$ for all vectors $\mathbf{v} \in \mathbb{R}^n$.

Why is positive semi-definiteness required?

The fundamental reason is that a kernel matrix $K$ represents a covariance matrix of a multivariate Gaussian distribution. For a distribution $\mathbf{f} \sim \mathcal{N}(\mathbf{m}, K)$ to be well-defined:

Variances must be non-negative: The diagonal elements $K_{ii} = k(\mathbf{x}_i, \mathbf{x}_i) \geq 0$ represent variances
The density must integrate: The Gaussian density $p(\mathbf{f}) \propto \exp(-\frac{1}{2}\mathbf{f}^T K^{-1} \mathbf{f})$ requires $K$ to be invertible (or we use the pseudoinverse)
Consistent marginals: Any subset of variables must also have a valid Gaussian distribution

All these requirements are satisfied if and only if $K$ is positive semi-definite.

checking_positive_definiteness.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from numpy.linalg import eigvalsh
 
def is_positive_semidefinite(K: np.ndarray, tol: float = 1e-10) -> bool:
    """
    Check if a matrix K is positive semi-definite.
    
    A matrix is PSD if and only if all its eigenvalues are non-negative.
    We use a small tolerance for numerical stability.
    
    Parameters:
    -----------
    K : np.ndarray
        Square matrix to check (should be symmetric)
    tol : float
        Tolerance for considering eigenvalues as non-negative
        
    Returns:
    --------
    bool : True if K is positive semi-definite
    """
    # First check symmetry (kernel matrices should be symmetric)
    if not np.allclose(K, K.T, atol=tol):
        print("Warning: Matrix is not symmetric")
        return False
    
    # Compute eigenvalues (eigvalsh is for symmetric matrices)
    eigenvalues = eigvalsh(K)
    
    # Check if all eigenvalues are non-negative (within tolerance)
    min_eigenvalue = np.min(eigenvalues)
    
    if min_eigenvalue < -tol:
        print(f"Minimum eigenvalue: {min_eigenvalue:.2e} - NOT PSD")
        return False
    
    print(f"Minimum eigenvalue: {min_eigenvalue:.2e} - Matrix is PSD")
    return True
 
 
def demonstrate_psd_check():
    """Demonstrate PSD checking with various matrices."""
    
    # Example 1: Valid kernel matrix (RBF kernel)
    X = np.array([[0], [1], [2], [3]])
    length_scale = 1.0
    
    # RBF kernel: k(x, x') = exp(-||x - x'||^2 / (2 * l^2))
    K_rbf = np.exp(-0.5 * np.sum((X[:, None] - X[None, :]) ** 2, axis=2) / length_scale**2)
    
    print("RBF Kernel Matrix:")
    print(K_rbf.round(4))
    is_positive_semidefinite(K_rbf)  # Should be True
    
    print("\n" + "="*50 + "\n")
    
    # Example 2: Invalid "kernel" - not PSD
    # This is NOT a valid kernel function!
    K_invalid = np.array([
        [1.0, 0.9, 0.9],
        [0.9, 1.0, -0.9],  # Negative correlation
        [0.9, -0.9, 1.0]
    ])
    
    print("Invalid 'Kernel' Matrix (not actually a kernel):")
    print(K_invalid)
    is_positive_semidefinite(K_invalid)  # Should be False
 
 
if __name__ == "__main__":
    demonstrate_psd_check()

Mercer's Theorem: The Foundation of Kernels

Mercer's theorem provides the theoretical foundation for understanding kernels. It states that any continuous, symmetric, positive semi-definite function $k(\mathbf{x}, \mathbf{x}')$ can be expressed as an inner product in some (possibly infinite-dimensional) Hilbert space $\mathcal{H}$:

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$

where $\phi: \mathcal{X} \rightarrow \mathcal{H}$ is a feature mapping. This is profound: every valid kernel implicitly defines a feature space, even if that feature space is infinite-dimensional.

For Gaussian Processes, this means:

The RBF kernel implicitly maps inputs to an infinite-dimensional feature space
The polynomial kernel maps to a finite (but high) dimensional space
We get the computational benefits of working in these spaces without ever computing the features explicitly

Key Properties Encoded by Kernels

Different kernels encode different assumptions about the functions they generate. Understanding these properties is essential for informed kernel selection.

Fundamental Kernel Properties

•Stationarity — A kernel is stationary if it depends only on the difference $\mathbf{x} - \mathbf{x}'$, not the absolute positions. Stationary kernels produce functions with the same statistical properties everywhere in input space. The RBF and Matérn kernels are stationary; the polynomial kernel is not.
•Isotropy — An isotropic kernel depends only on the distance $|\mathbf{x} - \mathbf{x}'|$, treating all input dimensions equally. Isotropic kernels are a subset of stationary kernels and are often called radial basis function kernels.
•Smoothness — How differentiable are the functions produced by the kernel? The RBF kernel produces infinitely differentiable functions, while the Matérn-1/2 (Exponential) kernel produces rough, non-differentiable functions. This is controlled by the smoothness parameter.
•Periodicity — Does the kernel expect the function to repeat at regular intervals? Periodic kernels are essential for modeling seasonal effects, cyclical phenomena, or any data with repeating patterns.
•Length Scale — How quickly does correlation decay with distance? Short length scales produce rapidly varying functions; long length scales produce slowly varying, smooth functions. This is perhaps the most important hyperparameter.
•Signal Variance — The overall amplitude or scale of the function. Large signal variance means the function can deviate significantly from the mean; small variance constrains the function to stay close to the mean.

Kernel Properties at a Glance
Property	Mathematical Characterization	Effect on Generated Functions
Stationarity	$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$	Translation invariance; same behavior everywhere
Isotropy	$k(\mathbf{x}, \mathbf{x}') = k(\|\mathbf{x} - \mathbf{x}'\|)$	Rotation invariance; no preferred direction
Smoothness	Differentiability of $k$ at origin	Differentiability of sample paths
Length Scale $\ell$	Rate of decay of $k$ with $\|\mathbf{x} - \mathbf{x}'\|$	Characteristic variation scale
Marginal Variance $\sigma^2$	$k(\mathbf{x}, \mathbf{x}) = \sigma^2$	Typical amplitude of deviations
Periodicity $p$	$k(\mathbf{x} + p, \mathbf{x}' + p) = k(\mathbf{x}, \mathbf{x}')$	Repeating patterns with period $p$

The Smoothness-Flexibility Tradeoff

There is a fundamental tradeoff between smoothness and flexibility. Smoother kernels (like RBF) produce elegant, continuous functions but may miss sharp transitions in the data. Rougher kernels (like Matérn-1/2) can capture discontinuities but may overfit to noise.

The right choice depends on your domain knowledge:

Physical systems often follow smooth dynamics → RBF or Matérn-5/2
Financial data often has jumps and discontinuities → Matérn-1/2 or -3/2
Natural phenomena often fall in between → Matérn-3/2 is a common default

The Squared Exponential (RBF) Kernel

The Squared Exponential kernel, also known as the Radial Basis Function (RBF) kernel or the Gaussian kernel, is the most widely used kernel in Gaussian Process regression. Its popularity stems from its mathematical tractability and the smooth functions it generates.

$$k_{SE}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

where:

$\sigma_f^2$ is the signal variance (controls the amplitude of variations)
$\ell$ is the length scale (controls how quickly correlation decays with distance)
$|\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance between inputs

Interpreting the Length Scale

The length scale $\ell$ is measured in the same units as your input features. If $\ell = 10$ and your inputs are in meters, then points more than about 30 meters apart (three length scales) will have nearly independent function values. Points within one length scale are strongly correlated.

rbf_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
 
 
class RBFKernel:
    """
    Squared Exponential (RBF) Kernel Implementation.
    
    The RBF kernel produces infinitely differentiable sample functions.
    It is the most commonly used kernel in Gaussian Process regression
    due to its mathematical simplicity and smooth function generation.
    
    Parameters:
    -----------
    length_scale : float
        The characteristic length scale. Larger values produce smoother,
        more slowly varying functions.
    signal_variance : float
        The signal variance σ². Controls the overall amplitude of the
        function variations.
    """
    
    def __init__(self, length_scale: float = 1.0, signal_variance: float = 1.0):
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """
        Compute the kernel matrix between X1 and X2.
        
        Parameters:
        -----------
        X1 : np.ndarray of shape (n1, d)
            First set of input points
        X2 : np.ndarray of shape (n2, d)
            Second set of input points
            
        Returns:
        --------
        K : np.ndarray of shape (n1, n2)
            Kernel matrix where K[i,j] = k(X1[i], X2[j])
        """
        # Compute squared Euclidean distances
        sq_dists = cdist(X1, X2, metric='sqeuclidean')
        
        # Apply RBF formula
        K = self.signal_variance * np.exp(-sq_dists / (2 * self.length_scale ** 2))
        
        return K
    
    def diagonal(self, X: np.ndarray) -> np.ndarray:
        """
        Compute only the diagonal elements k(x_i, x_i).
        
        For the RBF kernel, all diagonal elements equal signal_variance.
        This is useful for computing variances without full matrix computation.
        """
        return np.full(X.shape[0], self.signal_variance)
    
    def gradients(self, X1: np.ndarray, X2: np.ndarray) -> dict:
        """
        Compute gradients of the kernel with respect to hyperparameters.
        
        These gradients are essential for hyperparameter optimization
        via gradient descent on the log marginal likelihood.
        
        Returns:
        --------
        dict : Dictionary containing:
            - 'length_scale': ∂K/∂ℓ
            - 'signal_variance': ∂K/∂σ²
        """
        K = self(X1, X2)
        sq_dists = cdist(X1, X2, metric='sqeuclidean')
        
        # Gradient w.r.t. length scale
        # ∂K/∂ℓ = K * (||x - x'||² / ℓ³)
        dK_dl = K * sq_dists / (self.length_scale ** 3)
        
        # Gradient w.r.t. signal variance
        # ∂K/∂σ² = K / σ²
        dK_dsv = K / self.signal_variance
        
        return {
            'length_scale': dK_dl,
            'signal_variance': dK_dsv
        }
 
 
def visualize_rbf_effect():
    """Visualize the effect of different length scales on GP samples."""
    
    np.random.seed(42)
    
    # Test points
    X_test = np.linspace(0, 10, 200).reshape(-1, 1)
    
    # Different length scales
    length_scales = [0.5, 1.0, 2.0, 5.0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes = axes.flatten()
    
    for ax, ls in zip(axes, length_scales):
        kernel = RBFKernel(length_scale=ls, signal_variance=1.0)
        K = kernel(X_test, X_test)
        
        # Add small jitter for numerical stability
        K += 1e-8 * np.eye(len(X_test))
        
        # Draw 3 samples from the GP prior
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X_test), 3)
        
        for i in range(3):
            ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5)
        
        ax.set_title(f'Length Scale ℓ = {ls}', fontsize=12)
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
        ax.set_ylim(-3, 3)
    
    plt.suptitle('RBF Kernel: Effect of Length Scale on GP Samples', fontsize=14)
    plt.tight_layout()
    plt.savefig('rbf_length_scale_effect.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    visualize_rbf_effect()

Mathematical Properties of the RBF Kernel

Infinite differentiability: Sample functions from an RBF-kernel GP are infinitely differentiable (smooth). This can be both a strength (no jagged edges) and a weakness (may not capture discontinuities).
Universal approximator: The RBF kernel is universal, meaning it can approximate any continuous function on a compact set to arbitrary precision (given enough data).
Spectral density: The Fourier transform of the RBF kernel is also Gaussian, with spectral density: $$S(\omega) \propto \exp\left(-\frac{\ell^2 |\omega|^2}{2}\right)$$ This means functions drawn from an RBF kernel have most of their power at low frequencies.
Eigenfunction decomposition: On a bounded domain, the eigenfunctions of the RBF kernel are sine and cosine functions, with eigenvalues decaying rapidly.

The Smoothness Trap

The infinite smoothness of the RBF kernel can be a liability. Real-world data often contains jumps, kinks, or rough textures that the RBF kernel cannot capture well. If your predictions look 'too smooth' compared to reality, consider the Matérn family instead.

The Matérn Kernel Family

The Matérn kernel family is arguably the most important kernel family for practical Gaussian Process applications. Unlike the RBF kernel, Matérn kernels allow you to control the smoothness of the generated functions through a single parameter $\nu$.

$$k_{Matérn}(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{\ell}\right)$$

where:

$r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance
$\nu > 0$ is the smoothness parameter
$\ell$ is the length scale
$K_\nu$ is the modified Bessel function of the second kind
$\Gamma$ is the Gamma function

The Smoothness Parameter $\nu$

The parameter $\nu$ determines how many times the function is differentiable:

A GP with Matérn-$\nu$ kernel is $\lceil\nu\rceil - 1$ times mean-square differentiable

In practice, three special cases are used almost exclusively because they have simple closed forms:

Matérn-1/2 (Exponential Kernel): $$k_{1/2}(r) = \sigma_f^2 \exp\left(-\frac{r}{\ell}\right)$$ Produces rough, non-differentiable functions (continuous but nowhere differentiable, like Brownian motion).

Matérn-3/2: $$k_{3/2}(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3}r}{\ell}\right) \exp\left(-\frac{\sqrt{3}r}{\ell}\right)$$ Produces functions that are once differentiable but not twice. This is often recommended as a default.

Matérn-5/2: $$k_{5/2}(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right)$$ Produces functions that are twice differentiable. Smooth but not infinitely so.

matern_kernel_family.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from scipy.spatial.distance import cdist
 
 
class MaternKernel:
    """
    Matérn Kernel Family Implementation.
    
    The Matérn kernels provide a continuous parameterization of smoothness.
    In practice, we use ν ∈ {1/2, 3/2, 5/2} for computational efficiency.
    
    Parameters:
    -----------
    nu : float
        Smoothness parameter. Must be one of [0.5, 1.5, 2.5]
    length_scale : float
        Characteristic length scale
    signal_variance : float
        Signal variance σ²
    """
    
    def __init__(self, nu: float = 2.5, length_scale: float = 1.0, 
                 signal_variance: float = 1.0):
        if nu not in [0.5, 1.5, 2.5]:
            raise ValueError("nu must be 0.5, 1.5, or 2.5 for efficient computation")
        self.nu = nu
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the Matérn kernel matrix."""
        
        # Compute Euclidean distances
        dists = cdist(X1, X2, metric='euclidean')
        
        if self.nu == 0.5:
            # Matérn-1/2 (Exponential kernel)
            K = self._matern_1_2(dists)
        elif self.nu == 1.5:
            # Matérn-3/2
            K = self._matern_3_2(dists)
        elif self.nu == 2.5:
            # Matérn-5/2
            K = self._matern_5_2(dists)
        
        return self.signal_variance * K
    
    def _matern_1_2(self, dists: np.ndarray) -> np.ndarray:
        """
        Matérn-1/2: k(r) = exp(-r/ℓ)
        
        This is equivalent to the Ornstein-Uhlenbeck process.
        Produces continuous but not differentiable functions.
        """
        return np.exp(-dists / self.length_scale)
    
    def _matern_3_2(self, dists: np.ndarray) -> np.ndarray:
        """
        Matérn-3/2: k(r) = (1 + √3r/ℓ) * exp(-√3r/ℓ)
        
        Produces once-differentiable functions.
        A good default choice for many applications.
        """
        sqrt3_r_over_l = np.sqrt(3) * dists / self.length_scale
        return (1 + sqrt3_r_over_l) * np.exp(-sqrt3_r_over_l)
    
    def _matern_5_2(self, dists: np.ndarray) -> np.ndarray:
        """
        Matérn-5/2: k(r) = (1 + √5r/ℓ + 5r²/(3ℓ²)) * exp(-√5r/ℓ)
        
        Produces twice-differentiable functions.
        Good for modeling physical systems.
        """
        sqrt5_r_over_l = np.sqrt(5) * dists / self.length_scale
        return (1 + sqrt5_r_over_l + 5 * dists**2 / (3 * self.length_scale**2)) * \
               np.exp(-sqrt5_r_over_l)
 
 
def compare_matern_smoothness():
    """Visualize the effect of smoothness parameter on function samples."""
    
    import matplotlib.pyplot as plt
    np.random.seed(42)
    
    X_test = np.linspace(0, 10, 200).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for ax, nu in zip(axes, [0.5, 1.5, 2.5]):
        kernel = MaternKernel(nu=nu, length_scale=1.0, signal_variance=1.0)
        K = kernel(X_test, X_test)
        K += 1e-6 * np.eye(len(X_test))  # Numerical stability
        
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X_test), 3)
        
        for i in range(3):
            ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5)
        
        ax.set_title(f'Matérn-ν={nu}', fontsize=12)
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Matérn Kernel: Effect of Smoothness Parameter ν', fontsize=14)
    plt.tight_layout()
    plt.savefig('matern_smoothness_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    compare_matern_smoothness()

The Matérn-3/2 Default

Many practitioners recommend Matérn-3/2 as the default kernel instead of RBF. It provides a good balance between smoothness and flexibility, can capture moderate roughness in data, and avoids the pathologically smooth behavior of the RBF kernel. Carl Rasmussen (author of the canonical GP textbook) often advocates for this choice.

Periodic and Non-Stationary Kernels

Many real-world phenomena exhibit periodic patterns: daily temperature cycles, weekly sales patterns, annual seasonality, heartbeat rhythms. To capture these patterns, we need kernels that encode periodicity.

The Standard Periodic Kernel

$$k_{Per}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right)$$

where:

$p$ is the period of the function
$\ell$ is the length scale (controls smoothness within each period)
$\sigma_f^2$ is the signal variance

This kernel produces functions that are exactly periodic with period $p$. The function value at $x$ equals the function value at $x + p$ (up to random variation).

periodic_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from scipy.spatial.distance import cdist
 
 
class PeriodicKernel:
    """
    Periodic Kernel for modeling cyclical phenomena.
    
    This kernel produces functions that repeat with a specified period.
    Useful for: seasonal effects, daily/weekly patterns, physical oscillations.
    
    Parameters:
    -----------
    period : float
        The period of the function (in input units)
    length_scale : float
        Controls smoothness within each period
    signal_variance : float
        Controls amplitude of variations
    """
    
    def __init__(self, period: float = 1.0, length_scale: float = 1.0,
                 signal_variance: float = 1.0):
        self.period = period
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the periodic kernel matrix."""
        
        # Compute pairwise differences
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        # For 1D inputs, compute |x - x'|
        diffs = cdist(X1, X2, metric='euclidean')
        
        # Periodic kernel formula
        arg = np.pi * diffs / self.period
        K = self.signal_variance * np.exp(
            -2 * np.sin(arg)**2 / self.length_scale**2
        )
        
        return K
 
 
class LocallyPeriodicKernel:
    """
    Locally Periodic Kernel: periodicity that decays with distance.
    
    This is a product of the Periodic kernel and the RBF kernel:
    k_lp = k_per × k_rbf
    
    Produces functions that are approximately periodic but allow
    the pattern to vary slowly over time (e.g., seasonal patterns
    in climate that drift over decades).
    """
    
    def __init__(self, period: float = 1.0, periodic_ls: float = 1.0,
                 decay_ls: float = 10.0, signal_variance: float = 1.0):
        self.period = period
        self.periodic_ls = periodic_ls
        self.decay_ls = decay_ls
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the locally periodic kernel matrix."""
        
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        diffs = cdist(X1, X2, metric='euclidean')
        sq_diffs = diffs ** 2
        
        # Periodic component
        arg = np.pi * diffs / self.period
        K_per = np.exp(-2 * np.sin(arg)**2 / self.periodic_ls**2)
        
        # RBF decay component
        K_rbf = np.exp(-sq_diffs / (2 * self.decay_ls**2))
        
        return self.signal_variance * K_per * K_rbf
 
 
def demonstrate_periodic_kernels():
    """Visualize samples from periodic kernels."""
    
    import matplotlib.pyplot as plt
    np.random.seed(42)
    
    X_test = np.linspace(0, 10, 300).reshape(-1, 1)
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Pure periodic kernel
    kernel1 = PeriodicKernel(period=2.0, length_scale=0.5)
    K1 = kernel1(X_test, X_test) + 1e-6 * np.eye(len(X_test))
    L1 = np.linalg.cholesky(K1)
    samples1 = L1 @ np.random.randn(len(X_test), 2)
    
    axes[0, 0].plot(X_test, samples1, linewidth=1.5)
    axes[0, 0].set_title('Pure Periodic Kernel (period=2)', fontsize=12)
    axes[0, 0].axvline(x=2, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=4, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=6, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=8, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Locally periodic kernel
    kernel2 = LocallyPeriodicKernel(period=2.0, periodic_ls=0.5, decay_ls=4.0)
    K2 = kernel2(X_test, X_test) + 1e-6 * np.eye(len(X_test))
    L2 = np.linalg.cholesky(K2)
    samples2 = L2 @ np.random.randn(len(X_test), 2)
    
    axes[0, 1].plot(X_test, samples2, linewidth=1.5)
    axes[0, 1].set_title('Locally Periodic Kernel (pattern decays)', fontsize=12)
    axes[0, 1].set_xlabel('x')
    axes[0, 1].set_ylabel('f(x)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Effect of period parameter
    for period in [1.0, 3.0]:
        kernel = PeriodicKernel(period=period, length_scale=0.5)
        K = kernel(X_test, X_test) + 1e-6 * np.eye(len(X_test))
        L = np.linalg.cholesky(K)
        sample = (L @ np.random.randn(len(X_test), 1)).flatten()
        axes[1, 0].plot(X_test, sample, linewidth=1.5, label=f'period={period}')
    
    axes[1, 0].set_title('Effect of Period Parameter', fontsize=12)
    axes[1, 0].set_xlabel('x')
    axes[1, 0].set_ylabel('f(x)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Effect of length scale in periodic kernel
    for ls in [0.3, 0.7, 1.5]:
        kernel = PeriodicKernel(period=2.0, length_scale=ls)
        K = kernel(X_test, X_test) + 1e-6 * np.eye(len(X_test))
        L = np.linalg.cholesky(K)
        sample = (L @ np.random.randn(len(X_test), 1)).flatten()
        axes[1, 1].plot(X_test, sample, linewidth=1.5, label=f'length_scale={ls}')
    
    axes[1, 1].set_title('Effect of Length Scale (within period)', fontsize=12)
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('periodic_kernels.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    demonstrate_periodic_kernels()

Non-Stationary Kernels

All the kernels we've discussed so far are stationary—they depend only on the difference between inputs, not their absolute positions. But many real-world problems exhibit non-stationary behavior:

Function varies smoothly in one region but rapidly in another
Variance is higher in some regions than others
The characteristic scale changes across the input space

Common approaches to non-stationarity include:

Input warping: Transform the input space non-linearly, then apply a stationary kernel
Change-point kernels: Multiply by an indicator function that changes at specific locations
Deep Gaussian Processes: Stack multiple GPs where the output of one is the input to the next

Polynomial and Linear Kernels

While stationary kernels like RBF and Matérn are widely used, polynomial kernels offer a different perspective. They are non-stationary and can capture global trends and polynomial relationships in the data.

The Linear Kernel

$$k_{Lin}(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - c)(\mathbf{x}' - c)$$

where:

$\sigma_b^2$ is the bias variance (variance at the center point $c$)
$\sigma_v^2$ is the slope variance
$c$ is the center point (often set to the mean of the data)

The linear kernel produces linear functions (straight lines in 1D, hyperplanes in higher dimensions). It's equivalent to Bayesian linear regression.

The Polynomial Kernel

$$k_{Poly}(\mathbf{x}, \mathbf{x}') = (\sigma^2 + \mathbf{x}^T\mathbf{x}')^d$$

where $d$ is the polynomial degree. This produces functions that are polynomials of degree $d$.

Kernels and Bayesian Models

Many classical machine learning models can be viewed as GPs with specific kernels: • Linear regression → GP with linear kernel • Polynomial regression → GP with polynomial kernel • Neural networks (infinite width) → GP with specific kernels (Neal, 1996)

This connection provides a unified framework for understanding different models.

polynomial_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
 
 
class LinearKernel:
    """
    Linear Kernel: produces linear (straight-line) functions.
    
    This kernel is equivalent to Bayesian linear regression.
    Combining it with other kernels (e.g., RBF) produces
    functions with linear trends plus local variations.
    
    Parameters:
    -----------
    variance : float
        The signal variance controlling the slope magnitude
    offset : float
        Constant offset (c parameter, typically 0)
    """
    
    def __init__(self, variance: float = 1.0, offset: float = 0.0):
        self.variance = variance
        self.offset = offset
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the linear kernel matrix."""
        
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        # Center the data
        X1_centered = X1 - self.offset
        X2_centered = X2 - self.offset
        
        # Linear kernel is just the dot product
        K = self.variance * (X1_centered @ X2_centered.T)
        
        return K
 
 
class PolynomialKernel:
    """
    Polynomial Kernel: produces polynomial functions of degree d.
    
    Parameters:
    -----------
    degree : int
        The polynomial degree
    scale : float
        Scale parameter σ²
    offset : float
        Offset parameter (often 1.0)
    """
    
    def __init__(self, degree: int = 2, scale: float = 1.0, offset: float = 1.0):
        self.degree = degree
        self.scale = scale
        self.offset = offset
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the polynomial kernel matrix."""
        
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        # (σ² + x·x')^d
        dot_product = X1 @ X2.T
        K = (self.scale + dot_product) ** self.degree
        
        return K
 
 
def compare_polynomial_degrees():
    """Visualize samples from polynomial kernels of different degrees."""
    
    import matplotlib.pyplot as plt
    np.random.seed(42)
    
    X_test = np.linspace(-2, 2, 200).reshape(-1, 1)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    titles = ['Linear (d=1)', 'Quadratic (d=2)', 
              'Cubic (d=3)', 'Quartic (d=4)']
    
    for ax, degree, title in zip(axes.flatten(), [1, 2, 3, 4], titles):
        kernel = PolynomialKernel(degree=degree, scale=1.0)
        K = kernel(X_test, X_test)
        
        # Add jitter for stability
        K += 1e-4 * np.eye(len(X_test))
        
        # Cholesky may fail for high-degree polynomials due to numerical issues
        try:
            L = np.linalg.cholesky(K)
            samples = L @ np.random.randn(len(X_test), 3)
        except np.linalg.LinAlgError:
            # Use eigendecomposition instead
            eigvals, eigvecs = np.linalg.eigh(K)
            eigvals = np.maximum(eigvals, 1e-10)
            samples = eigvecs @ np.diag(np.sqrt(eigvals)) @ np.random.randn(len(X_test), 3)
        
        for i in range(3):
            ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5)
        
        ax.set_title(f'Polynomial Kernel ({title})', fontsize=12)
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Polynomial Kernel: Effect of Degree on Function Shape', fontsize=14)
    plt.tight_layout()
    plt.savefig('polynomial_kernel_degrees.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    compare_polynomial_degrees()

The Rational Quadratic Kernel

The Rational Quadratic (RQ) kernel can be viewed as a scale mixture of RBF kernels with different length scales. This makes it useful when you're uncertain about the correct length scale, or when the function exhibits variation at multiple scales simultaneously.

$$k_{RQ}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \left(1 + \frac{|\mathbf{x} - \mathbf{x}'|^2}{2\alpha\ell^2}\right)^{-\alpha}$$

where:

$\alpha > 0$ is the scale mixture parameter
$\ell$ is the characteristic length scale
$\sigma_f^2$ is the signal variance

Understanding the α Parameter

The parameter α controls the relative weighting of different length scales: • α → ∞: The RQ kernel converges to the RBF kernel • Small α: The kernel gives more weight to larger-scale correlations • Large α: The kernel behaves more like the standard RBF

This makes the RQ kernel more flexible than the pure RBF kernel while maintaining smoothness.

Mathematical Insight: The Scale Mixture Interpretation

The RQ kernel can be derived as an infinite mixture of RBF kernels:

$$k_{RQ}(r) = \int_0^\infty k_{RBF}(r; \ell') \cdot p(\ell' | \alpha, \ell) , d\ell'$$

where the length scales $\ell'$ are drawn from an inverse Gamma distribution. This means the RQ kernel automatically considers a range of possible length scales, weighted according to the inverse Gamma prior.

This property is particularly useful when:

You don't know the correct length scale a priori
The function has components at multiple scales (e.g., fast oscillations superimposed on slow trends)
You want robustness to length scale misspecification

Summary: Kernel Design Principles

Designing the right kernel is both an art and a science. It requires balancing mathematical constraints with domain knowledge and empirical validation.

Key Takeaways

•Kernels must be positive semi-definite — This ensures the resulting covariance matrix defines a valid probability distribution
•Kernels encode prior assumptions — Stationarity, smoothness, periodicity, and length scale all communicate your beliefs about the function
•The RBF kernel is infinitely smooth — Good for modeling continuous physical phenomena, but may be too smooth for noisy or rough data
•The Matérn family controls smoothness — ν=3/2 is a good default; smaller values allow rougher functions
•Periodic kernels capture cyclical patterns — Essential for seasonal data, but ensure you have enough data to identify the period
•Polynomial kernels are non-stationary — They capture global trends and can be combined with stationary kernels
•The RQ kernel mixes length scales — Robust when uncertain about the correct scale or when multi-scale behavior is present

Page Complete

You now understand the foundational principles of kernel design for Gaussian Processes. You can recognize the mathematical requirements for valid kernels, understand what properties different kernels encode, and make informed choices based on domain knowledge. Next, we'll explore how to combine kernels to create more expressive models that capture complex patterns in data.

1 / 5

Loading learning content...

Machine LearningGP Kernels and Hyperparameters

Gaussian Process Kernels and Hyperparameter Optimization

LevelAdvanced

Duration90 mins

TopicGP Kernels and Hyperparameters

1 / 5

Kernel Design: The Foundation of Gaussian Process Modeling

The Language of Assumptions

Think of the kernel as the language you use to communicate with the learning algorithm. Through the kernel, you tell the GP:

How smooth should the function be?
Should nearby points have similar values?
Are there periodic patterns in the data?
Do different input dimensions have different relevance?
Should the function vary smoothly everywhere, or allow for discontinuities?

What You Will Learn

Formal Definition and Requirements

$$\text{Cov}(f(\mathbf{x}), f(\mathbf{x}')) = k(\mathbf{x}, \mathbf{x}')$$

The Positive Semi-Definite Requirement

Why is positive semi-definiteness required?

Variances must be non-negative: The diagonal elements $K_{ii} = k(\mathbf{x}_i, \mathbf{x}_i) \geq 0$ represent variances
The density must integrate: The Gaussian density $p(\mathbf{f}) \propto \exp(-\frac{1}{2}\mathbf{f}^T K^{-1} \mathbf{f})$ requires $K$ to be invertible (or we use the pseudoinverse)
Consistent marginals: Any subset of variables must also have a valid Gaussian distribution

All these requirements are satisfied if and only if $K$ is positive semi-definite.

checking_positive_definiteness.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from numpy.linalg import eigvalsh
 
def is_positive_semidefinite(K: np.ndarray, tol: float = 1e-10) -> bool:
    """
    Check if a matrix K is positive semi-definite.
    
    A matrix is PSD if and only if all its eigenvalues are non-negative.
    We use a small tolerance for numerical stability.
    
    Parameters:
    -----------
    K : np.ndarray
        Square matrix to check (should be symmetric)
    tol : float
        Tolerance for considering eigenvalues as non-negative
        
    Returns:
    --------
    bool : True if K is positive semi-definite
    """
    # First check symmetry (kernel matrices should be symmetric)
    if not np.allclose(K, K.T, atol=tol):
        print("Warning: Matrix is not symmetric")
        return False
    
    # Compute eigenvalues (eigvalsh is for symmetric matrices)
    eigenvalues = eigvalsh(K)
    
    # Check if all eigenvalues are non-negative (within tolerance)
    min_eigenvalue = np.min(eigenvalues)
    
    if min_eigenvalue < -tol:
        print(f"Minimum eigenvalue: {min_eigenvalue:.2e} - NOT PSD")
        return False
    
    print(f"Minimum eigenvalue: {min_eigenvalue:.2e} - Matrix is PSD")
    return True
 
 
def demonstrate_psd_check():
    """Demonstrate PSD checking with various matrices."""
    
    # Example 1: Valid kernel matrix (RBF kernel)
    X = np.array([[0], [1], [2], [3]])
    length_scale = 1.0
    
    # RBF kernel: k(x, x') = exp(-||x - x'||^2 / (2 * l^2))
    K_rbf = np.exp(-0.5 * np.sum((X[:, None] - X[None, :]) ** 2, axis=2) / length_scale**2)
    
    print("RBF Kernel Matrix:")
    print(K_rbf.round(4))
    is_positive_semidefinite(K_rbf)  # Should be True
    
    print("\n" + "="*50 + "\n")
    
    # Example 2: Invalid "kernel" - not PSD
    # This is NOT a valid kernel function!
    K_invalid = np.array([
        [1.0, 0.9, 0.9],
        [0.9, 1.0, -0.9],  # Negative correlation
        [0.9, -0.9, 1.0]
    ])
    
    print("Invalid 'Kernel' Matrix (not actually a kernel):")
    print(K_invalid)
    is_positive_semidefinite(K_invalid)  # Should be False
 
 
if __name__ == "__main__":
    demonstrate_psd_check()

Mercer's Theorem: The Foundation of Kernels

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$

For Gaussian Processes, this means:

The RBF kernel implicitly maps inputs to an infinite-dimensional feature space
The polynomial kernel maps to a finite (but high) dimensional space
We get the computational benefits of working in these spaces without ever computing the features explicitly

Key Properties Encoded by Kernels

Different kernels encode different assumptions about the functions they generate. Understanding these properties is essential for informed kernel selection.

Fundamental Kernel Properties

•Stationarity — A kernel is stationary if it depends only on the difference $\mathbf{x} - \mathbf{x}'$, not the absolute positions. Stationary kernels produce functions with the same statistical properties everywhere in input space. The RBF and Matérn kernels are stationary; the polynomial kernel is not.
•Isotropy — An isotropic kernel depends only on the distance $|\mathbf{x} - \mathbf{x}'|$, treating all input dimensions equally. Isotropic kernels are a subset of stationary kernels and are often called radial basis function kernels.
•Smoothness — How differentiable are the functions produced by the kernel? The RBF kernel produces infinitely differentiable functions, while the Matérn-1/2 (Exponential) kernel produces rough, non-differentiable functions. This is controlled by the smoothness parameter.
•Periodicity — Does the kernel expect the function to repeat at regular intervals? Periodic kernels are essential for modeling seasonal effects, cyclical phenomena, or any data with repeating patterns.
•Length Scale — How quickly does correlation decay with distance? Short length scales produce rapidly varying functions; long length scales produce slowly varying, smooth functions. This is perhaps the most important hyperparameter.
•Signal Variance — The overall amplitude or scale of the function. Large signal variance means the function can deviate significantly from the mean; small variance constrains the function to stay close to the mean.

Kernel Properties at a Glance
Property	Mathematical Characterization	Effect on Generated Functions
Stationarity	$k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$	Translation invariance; same behavior everywhere
Isotropy	$k(\mathbf{x}, \mathbf{x}') = k(\|\mathbf{x} - \mathbf{x}'\|)$	Rotation invariance; no preferred direction
Smoothness	Differentiability of $k$ at origin	Differentiability of sample paths
Length Scale $\ell$	Rate of decay of $k$ with $\|\mathbf{x} - \mathbf{x}'\|$	Characteristic variation scale
Marginal Variance $\sigma^2$	$k(\mathbf{x}, \mathbf{x}) = \sigma^2$	Typical amplitude of deviations
Periodicity $p$	$k(\mathbf{x} + p, \mathbf{x}' + p) = k(\mathbf{x}, \mathbf{x}')$	Repeating patterns with period $p$

The Smoothness-Flexibility Tradeoff

The right choice depends on your domain knowledge:

Physical systems often follow smooth dynamics → RBF or Matérn-5/2
Financial data often has jumps and discontinuities → Matérn-1/2 or -3/2
Natural phenomena often fall in between → Matérn-3/2 is a common default

The Squared Exponential (RBF) Kernel

$$k_{SE}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$$

where:

$\sigma_f^2$ is the signal variance (controls the amplitude of variations)
$\ell$ is the length scale (controls how quickly correlation decays with distance)
$|\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance between inputs

Interpreting the Length Scale

rbf_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
 
 
class RBFKernel:
    """
    Squared Exponential (RBF) Kernel Implementation.
    
    The RBF kernel produces infinitely differentiable sample functions.
    It is the most commonly used kernel in Gaussian Process regression
    due to its mathematical simplicity and smooth function generation.
    
    Parameters:
    -----------
    length_scale : float
        The characteristic length scale. Larger values produce smoother,
        more slowly varying functions.
    signal_variance : float
        The signal variance σ². Controls the overall amplitude of the
        function variations.
    """
    
    def __init__(self, length_scale: float = 1.0, signal_variance: float = 1.0):
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """
        Compute the kernel matrix between X1 and X2.
        
        Parameters:
        -----------
        X1 : np.ndarray of shape (n1, d)
            First set of input points
        X2 : np.ndarray of shape (n2, d)
            Second set of input points
            
        Returns:
        --------
        K : np.ndarray of shape (n1, n2)
            Kernel matrix where K[i,j] = k(X1[i], X2[j])
        """
        # Compute squared Euclidean distances
        sq_dists = cdist(X1, X2, metric='sqeuclidean')
        
        # Apply RBF formula
        K = self.signal_variance * np.exp(-sq_dists / (2 * self.length_scale ** 2))
        
        return K
    
    def diagonal(self, X: np.ndarray) -> np.ndarray:
        """
        Compute only the diagonal elements k(x_i, x_i).
        
        For the RBF kernel, all diagonal elements equal signal_variance.
        This is useful for computing variances without full matrix computation.
        """
        return np.full(X.shape[0], self.signal_variance)
    
    def gradients(self, X1: np.ndarray, X2: np.ndarray) -> dict:
        """
        Compute gradients of the kernel with respect to hyperparameters.
        
        These gradients are essential for hyperparameter optimization
        via gradient descent on the log marginal likelihood.
        
        Returns:
        --------
        dict : Dictionary containing:
            - 'length_scale': ∂K/∂ℓ
            - 'signal_variance': ∂K/∂σ²
        """
        K = self(X1, X2)
        sq_dists = cdist(X1, X2, metric='sqeuclidean')
        
        # Gradient w.r.t. length scale
        # ∂K/∂ℓ = K * (||x - x'||² / ℓ³)
        dK_dl = K * sq_dists / (self.length_scale ** 3)
        
        # Gradient w.r.t. signal variance
        # ∂K/∂σ² = K / σ²
        dK_dsv = K / self.signal_variance
        
        return {
            'length_scale': dK_dl,
            'signal_variance': dK_dsv
        }
 
 
def visualize_rbf_effect():
    """Visualize the effect of different length scales on GP samples."""
    
    np.random.seed(42)
    
    # Test points
    X_test = np.linspace(0, 10, 200).reshape(-1, 1)
    
    # Different length scales
    length_scales = [0.5, 1.0, 2.0, 5.0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes = axes.flatten()
    
    for ax, ls in zip(axes, length_scales):
        kernel = RBFKernel(length_scale=ls, signal_variance=1.0)
        K = kernel(X_test, X_test)
        
        # Add small jitter for numerical stability
        K += 1e-8 * np.eye(len(X_test))
        
        # Draw 3 samples from the GP prior
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X_test), 3)
        
        for i in range(3):
            ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5)
        
        ax.set_title(f'Length Scale ℓ = {ls}', fontsize=12)
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
        ax.set_ylim(-3, 3)
    
    plt.suptitle('RBF Kernel: Effect of Length Scale on GP Samples', fontsize=14)
    plt.tight_layout()
    plt.savefig('rbf_length_scale_effect.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    visualize_rbf_effect()

Mathematical Properties of the RBF Kernel

Infinite differentiability: Sample functions from an RBF-kernel GP are infinitely differentiable (smooth). This can be both a strength (no jagged edges) and a weakness (may not capture discontinuities).
Universal approximator: The RBF kernel is universal, meaning it can approximate any continuous function on a compact set to arbitrary precision (given enough data).
Spectral density: The Fourier transform of the RBF kernel is also Gaussian, with spectral density: $$S(\omega) \propto \exp\left(-\frac{\ell^2 |\omega|^2}{2}\right)$$ This means functions drawn from an RBF kernel have most of their power at low frequencies.
Eigenfunction decomposition: On a bounded domain, the eigenfunctions of the RBF kernel are sine and cosine functions, with eigenvalues decaying rapidly.

The Smoothness Trap

The Matérn Kernel Family

$$k_{Matérn}(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{\ell}\right)$$

where:

$r = |\mathbf{x} - \mathbf{x}'|$ is the Euclidean distance
$\nu > 0$ is the smoothness parameter
$\ell$ is the length scale
$K_\nu$ is the modified Bessel function of the second kind
$\Gamma$ is the Gamma function

The Smoothness Parameter $\nu$

The parameter $\nu$ determines how many times the function is differentiable:

A GP with Matérn-$\nu$ kernel is $\lceil\nu\rceil - 1$ times mean-square differentiable

In practice, three special cases are used almost exclusively because they have simple closed forms:

matern_kernel_family.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from scipy.spatial.distance import cdist
 
 
class MaternKernel:
    """
    Matérn Kernel Family Implementation.
    
    The Matérn kernels provide a continuous parameterization of smoothness.
    In practice, we use ν ∈ {1/2, 3/2, 5/2} for computational efficiency.
    
    Parameters:
    -----------
    nu : float
        Smoothness parameter. Must be one of [0.5, 1.5, 2.5]
    length_scale : float
        Characteristic length scale
    signal_variance : float
        Signal variance σ²
    """
    
    def __init__(self, nu: float = 2.5, length_scale: float = 1.0, 
                 signal_variance: float = 1.0):
        if nu not in [0.5, 1.5, 2.5]:
            raise ValueError("nu must be 0.5, 1.5, or 2.5 for efficient computation")
        self.nu = nu
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the Matérn kernel matrix."""
        
        # Compute Euclidean distances
        dists = cdist(X1, X2, metric='euclidean')
        
        if self.nu == 0.5:
            # Matérn-1/2 (Exponential kernel)
            K = self._matern_1_2(dists)
        elif self.nu == 1.5:
            # Matérn-3/2
            K = self._matern_3_2(dists)
        elif self.nu == 2.5:
            # Matérn-5/2
            K = self._matern_5_2(dists)
        
        return self.signal_variance * K
    
    def _matern_1_2(self, dists: np.ndarray) -> np.ndarray:
        """
        Matérn-1/2: k(r) = exp(-r/ℓ)
        
        This is equivalent to the Ornstein-Uhlenbeck process.
        Produces continuous but not differentiable functions.
        """
        return np.exp(-dists / self.length_scale)
    
    def _matern_3_2(self, dists: np.ndarray) -> np.ndarray:
        """
        Matérn-3/2: k(r) = (1 + √3r/ℓ) * exp(-√3r/ℓ)
        
        Produces once-differentiable functions.
        A good default choice for many applications.
        """
        sqrt3_r_over_l = np.sqrt(3) * dists / self.length_scale
        return (1 + sqrt3_r_over_l) * np.exp(-sqrt3_r_over_l)
    
    def _matern_5_2(self, dists: np.ndarray) -> np.ndarray:
        """
        Matérn-5/2: k(r) = (1 + √5r/ℓ + 5r²/(3ℓ²)) * exp(-√5r/ℓ)
        
        Produces twice-differentiable functions.
        Good for modeling physical systems.
        """
        sqrt5_r_over_l = np.sqrt(5) * dists / self.length_scale
        return (1 + sqrt5_r_over_l + 5 * dists**2 / (3 * self.length_scale**2)) * \
               np.exp(-sqrt5_r_over_l)
 
 
def compare_matern_smoothness():
    """Visualize the effect of smoothness parameter on function samples."""
    
    import matplotlib.pyplot as plt
    np.random.seed(42)
    
    X_test = np.linspace(0, 10, 200).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for ax, nu in zip(axes, [0.5, 1.5, 2.5]):
        kernel = MaternKernel(nu=nu, length_scale=1.0, signal_variance=1.0)
        K = kernel(X_test, X_test)
        K += 1e-6 * np.eye(len(X_test))  # Numerical stability
        
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X_test), 3)
        
        for i in range(3):
            ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5)
        
        ax.set_title(f'Matérn-ν={nu}', fontsize=12)
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Matérn Kernel: Effect of Smoothness Parameter ν', fontsize=14)
    plt.tight_layout()
    plt.savefig('matern_smoothness_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    compare_matern_smoothness()

The Matérn-3/2 Default

Periodic and Non-Stationary Kernels

The Standard Periodic Kernel

$$k_{Per}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi|\mathbf{x} - \mathbf{x}'|/p)}{\ell^2}\right)$$

where:

$p$ is the period of the function
$\ell$ is the length scale (controls smoothness within each period)
$\sigma_f^2$ is the signal variance

This kernel produces functions that are exactly periodic with period $p$. The function value at $x$ equals the function value at $x + p$ (up to random variation).

periodic_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from scipy.spatial.distance import cdist
 
 
class PeriodicKernel:
    """
    Periodic Kernel for modeling cyclical phenomena.
    
    This kernel produces functions that repeat with a specified period.
    Useful for: seasonal effects, daily/weekly patterns, physical oscillations.
    
    Parameters:
    -----------
    period : float
        The period of the function (in input units)
    length_scale : float
        Controls smoothness within each period
    signal_variance : float
        Controls amplitude of variations
    """
    
    def __init__(self, period: float = 1.0, length_scale: float = 1.0,
                 signal_variance: float = 1.0):
        self.period = period
        self.length_scale = length_scale
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the periodic kernel matrix."""
        
        # Compute pairwise differences
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        # For 1D inputs, compute |x - x'|
        diffs = cdist(X1, X2, metric='euclidean')
        
        # Periodic kernel formula
        arg = np.pi * diffs / self.period
        K = self.signal_variance * np.exp(
            -2 * np.sin(arg)**2 / self.length_scale**2
        )
        
        return K
 
 
class LocallyPeriodicKernel:
    """
    Locally Periodic Kernel: periodicity that decays with distance.
    
    This is a product of the Periodic kernel and the RBF kernel:
    k_lp = k_per × k_rbf
    
    Produces functions that are approximately periodic but allow
    the pattern to vary slowly over time (e.g., seasonal patterns
    in climate that drift over decades).
    """
    
    def __init__(self, period: float = 1.0, periodic_ls: float = 1.0,
                 decay_ls: float = 10.0, signal_variance: float = 1.0):
        self.period = period
        self.periodic_ls = periodic_ls
        self.decay_ls = decay_ls
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the locally periodic kernel matrix."""
        
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        diffs = cdist(X1, X2, metric='euclidean')
        sq_diffs = diffs ** 2
        
        # Periodic component
        arg = np.pi * diffs / self.period
        K_per = np.exp(-2 * np.sin(arg)**2 / self.periodic_ls**2)
        
        # RBF decay component
        K_rbf = np.exp(-sq_diffs / (2 * self.decay_ls**2))
        
        return self.signal_variance * K_per * K_rbf
 
 
def demonstrate_periodic_kernels():
    """Visualize samples from periodic kernels."""
    
    import matplotlib.pyplot as plt
    np.random.seed(42)
    
    X_test = np.linspace(0, 10, 300).reshape(-1, 1)
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Pure periodic kernel
    kernel1 = PeriodicKernel(period=2.0, length_scale=0.5)
    K1 = kernel1(X_test, X_test) + 1e-6 * np.eye(len(X_test))
    L1 = np.linalg.cholesky(K1)
    samples1 = L1 @ np.random.randn(len(X_test), 2)
    
    axes[0, 0].plot(X_test, samples1, linewidth=1.5)
    axes[0, 0].set_title('Pure Periodic Kernel (period=2)', fontsize=12)
    axes[0, 0].axvline(x=2, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=4, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=6, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=8, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Locally periodic kernel
    kernel2 = LocallyPeriodicKernel(period=2.0, periodic_ls=0.5, decay_ls=4.0)
    K2 = kernel2(X_test, X_test) + 1e-6 * np.eye(len(X_test))
    L2 = np.linalg.cholesky(K2)
    samples2 = L2 @ np.random.randn(len(X_test), 2)
    
    axes[0, 1].plot(X_test, samples2, linewidth=1.5)
    axes[0, 1].set_title('Locally Periodic Kernel (pattern decays)', fontsize=12)
    axes[0, 1].set_xlabel('x')
    axes[0, 1].set_ylabel('f(x)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Effect of period parameter
    for period in [1.0, 3.0]:
        kernel = PeriodicKernel(period=period, length_scale=0.5)
        K = kernel(X_test, X_test) + 1e-6 * np.eye(len(X_test))
        L = np.linalg.cholesky(K)
        sample = (L @ np.random.randn(len(X_test), 1)).flatten()
        axes[1, 0].plot(X_test, sample, linewidth=1.5, label=f'period={period}')
    
    axes[1, 0].set_title('Effect of Period Parameter', fontsize=12)
    axes[1, 0].set_xlabel('x')
    axes[1, 0].set_ylabel('f(x)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Effect of length scale in periodic kernel
    for ls in [0.3, 0.7, 1.5]:
        kernel = PeriodicKernel(period=2.0, length_scale=ls)
        K = kernel(X_test, X_test) + 1e-6 * np.eye(len(X_test))
        L = np.linalg.cholesky(K)
        sample = (L @ np.random.randn(len(X_test), 1)).flatten()
        axes[1, 1].plot(X_test, sample, linewidth=1.5, label=f'length_scale={ls}')
    
    axes[1, 1].set_title('Effect of Length Scale (within period)', fontsize=12)
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('periodic_kernels.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    demonstrate_periodic_kernels()

Non-Stationary Kernels

Function varies smoothly in one region but rapidly in another
Variance is higher in some regions than others
The characteristic scale changes across the input space

Common approaches to non-stationarity include:

Input warping: Transform the input space non-linearly, then apply a stationary kernel
Change-point kernels: Multiply by an indicator function that changes at specific locations
Deep Gaussian Processes: Stack multiple GPs where the output of one is the input to the next

Polynomial and Linear Kernels

The Linear Kernel

$$k_{Lin}(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - c)(\mathbf{x}' - c)$$

where:

$\sigma_b^2$ is the bias variance (variance at the center point $c$)
$\sigma_v^2$ is the slope variance
$c$ is the center point (often set to the mean of the data)

The linear kernel produces linear functions (straight lines in 1D, hyperplanes in higher dimensions). It's equivalent to Bayesian linear regression.

The Polynomial Kernel

$$k_{Poly}(\mathbf{x}, \mathbf{x}') = (\sigma^2 + \mathbf{x}^T\mathbf{x}')^d$$

where $d$ is the polynomial degree. This produces functions that are polynomials of degree $d$.

Kernels and Bayesian Models

This connection provides a unified framework for understanding different models.

polynomial_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
 
 
class LinearKernel:
    """
    Linear Kernel: produces linear (straight-line) functions.
    
    This kernel is equivalent to Bayesian linear regression.
    Combining it with other kernels (e.g., RBF) produces
    functions with linear trends plus local variations.
    
    Parameters:
    -----------
    variance : float
        The signal variance controlling the slope magnitude
    offset : float
        Constant offset (c parameter, typically 0)
    """
    
    def __init__(self, variance: float = 1.0, offset: float = 0.0):
        self.variance = variance
        self.offset = offset
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the linear kernel matrix."""
        
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        # Center the data
        X1_centered = X1 - self.offset
        X2_centered = X2 - self.offset
        
        # Linear kernel is just the dot product
        K = self.variance * (X1_centered @ X2_centered.T)
        
        return K
 
 
class PolynomialKernel:
    """
    Polynomial Kernel: produces polynomial functions of degree d.
    
    Parameters:
    -----------
    degree : int
        The polynomial degree
    scale : float
        Scale parameter σ²
    offset : float
        Offset parameter (often 1.0)
    """
    
    def __init__(self, degree: int = 2, scale: float = 1.0, offset: float = 1.0):
        self.degree = degree
        self.scale = scale
        self.offset = offset
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute the polynomial kernel matrix."""
        
        if X1.ndim == 1:
            X1 = X1.reshape(-1, 1)
        if X2.ndim == 1:
            X2 = X2.reshape(-1, 1)
        
        # (σ² + x·x')^d
        dot_product = X1 @ X2.T
        K = (self.scale + dot_product) ** self.degree
        
        return K
 
 
def compare_polynomial_degrees():
    """Visualize samples from polynomial kernels of different degrees."""
    
    import matplotlib.pyplot as plt
    np.random.seed(42)
    
    X_test = np.linspace(-2, 2, 200).reshape(-1, 1)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    titles = ['Linear (d=1)', 'Quadratic (d=2)', 
              'Cubic (d=3)', 'Quartic (d=4)']
    
    for ax, degree, title in zip(axes.flatten(), [1, 2, 3, 4], titles):
        kernel = PolynomialKernel(degree=degree, scale=1.0)
        K = kernel(X_test, X_test)
        
        # Add jitter for stability
        K += 1e-4 * np.eye(len(X_test))
        
        # Cholesky may fail for high-degree polynomials due to numerical issues
        try:
            L = np.linalg.cholesky(K)
            samples = L @ np.random.randn(len(X_test), 3)
        except np.linalg.LinAlgError:
            # Use eigendecomposition instead
            eigvals, eigvecs = np.linalg.eigh(K)
            eigvals = np.maximum(eigvals, 1e-10)
            samples = eigvecs @ np.diag(np.sqrt(eigvals)) @ np.random.randn(len(X_test), 3)
        
        for i in range(3):
            ax.plot(X_test, samples[:, i], alpha=0.8, linewidth=1.5)
        
        ax.set_title(f'Polynomial Kernel ({title})', fontsize=12)
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Polynomial Kernel: Effect of Degree on Function Shape', fontsize=14)
    plt.tight_layout()
    plt.savefig('polynomial_kernel_degrees.png', dpi=150, bbox_inches='tight')
    plt.show()
 
 
if __name__ == "__main__":
    compare_polynomial_degrees()

The Rational Quadratic Kernel

$$k_{RQ}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \left(1 + \frac{|\mathbf{x} - \mathbf{x}'|^2}{2\alpha\ell^2}\right)^{-\alpha}$$

where:

$\alpha > 0$ is the scale mixture parameter
$\ell$ is the characteristic length scale
$\sigma_f^2$ is the signal variance

Understanding the α Parameter

This makes the RQ kernel more flexible than the pure RBF kernel while maintaining smoothness.

Mathematical Insight: The Scale Mixture Interpretation

The RQ kernel can be derived as an infinite mixture of RBF kernels:

$$k_{RQ}(r) = \int_0^\infty k_{RBF}(r; \ell') \cdot p(\ell' | \alpha, \ell) , d\ell'$$

This property is particularly useful when:

You don't know the correct length scale a priori
The function has components at multiple scales (e.g., fast oscillations superimposed on slow trends)
You want robustness to length scale misspecification

Summary: Kernel Design Principles

Designing the right kernel is both an art and a science. It requires balancing mathematical constraints with domain knowledge and empirical validation.

Key Takeaways

•Kernels must be positive semi-definite — This ensures the resulting covariance matrix defines a valid probability distribution
•Kernels encode prior assumptions — Stationarity, smoothness, periodicity, and length scale all communicate your beliefs about the function
•The RBF kernel is infinitely smooth — Good for modeling continuous physical phenomena, but may be too smooth for noisy or rough data
•The Matérn family controls smoothness — ν=3/2 is a good default; smaller values allow rougher functions
•Periodic kernels capture cyclical patterns — Essential for seasonal data, but ensure you have enough data to identify the period
•Polynomial kernels are non-stationary — They capture global trends and can be combined with stationary kernels
•The RQ kernel mixes length scales — Robust when uncertain about the correct scale or when multi-scale behavior is present

Page Complete

1 / 5