Machine LearningGP Kernels and Hyperparameters

Gaussian Process Kernels and Hyperparameter Optimization

LevelAdvanced

Duration90 mins

TopicGP Kernels and Hyperparameters

2 / 5

Combining Kernels: Building Expressive Models

The Compositional Power of Kernels

One of the most powerful features of kernel-based methods is that kernels can be combined to create new, more expressive kernels. Just as we build complex programs from simple functions, we can build complex statistical models from simple kernel building blocks.

The key insight is that certain operations on valid kernels produce new valid kernels. This allows us to compose kernels that capture:

Long-term trends with short-term variations
Periodic patterns that decay over time
Different behaviors in different regions of input space
Multi-scale structure at various resolutions

What You Will Learn

By the end of this page, you will understand the mathematical rules for combining kernels, master the art of kernel addition and multiplication, and be able to design composite kernels that encode complex domain knowledge into your Gaussian Process models.

The Algebra of Kernels

Valid kernels form a closed algebra under certain operations. This means combining valid kernels in specific ways always produces another valid kernel.

Fundamental Kernel Combination Rules:

Let $k_1$ and $k_2$ be valid (positive semi-definite) kernels. Then:

Sum: $k(x, x') = k_1(x, x') + k_2(x, x')$ is a valid kernel
Product: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ is a valid kernel
Scalar multiplication: $k(x, x') = c \cdot k_1(x, x')$ is valid for any $c > 0$
Exponentiation: $k(x, x') = \exp(k_1(x, x'))$ is valid if $k_1$ is valid

Subtraction and Division are NOT Valid

Subtracting kernels ($k_1 - k_2$) or dividing kernels ($k_1 / k_2$) does NOT generally produce a valid kernel. The resulting matrix may not be positive semi-definite. Always use addition and multiplication for kernel composition.

Proof Sketch for Sum of Kernels:

If $K_1$ and $K_2$ are positive semi-definite matrices, then for any vector $\mathbf{v}$:

$$\mathbf{v}^T(K_1 + K_2)\mathbf{v} = \mathbf{v}^TK_1\mathbf{v} + \mathbf{v}^TK_2\mathbf{v} \geq 0 + 0 = 0$$

Since both terms are non-negative (by PSD property), their sum is also non-negative.

Proof Sketch for Product of Kernels (Schur Product Theorem):

The element-wise (Hadamard) product of two PSD matrices is also PSD. This is known as the Schur product theorem and follows from the eigenvalue decomposition of the Hadamard product.

Kernel Addition: Superposition of Functions

When we add two kernels, we model the target function as a sum of independent components:

$$f(x) = f_1(x) + f_2(x)$$

where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent.

The covariance of the sum is: $$\text{Cov}(f(x), f(x')) = \text{Cov}(f_1(x) + f_2(x), f_1(x') + f_2(x'))$$ $$= k_1(x, x') + k_2(x, x')$$

Common Kernel Addition Patterns
Combination	Interpretation	Use Case
$k_{RBF} + k_{Lin}$	Smooth variations + linear trend	Data with underlying linear growth
$k_{Per} + k_{RBF}$	Periodic component + aperiodic noise	Seasonal data with irregular variations
$k_{\ell_1} + k_{\ell_2}$	Multi-scale: short + long range	Functions with detail at multiple scales
$k + \sigma_n^2 I$	Signal + independent noise	Standard GP regression noise model

kernel_addition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from scipy.spatial.distance import cdist
 
class SumKernel:
    """Sum of two kernels: k(x,x') = k1(x,x') + k2(x,x')"""
    
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2):
        return self.kernel1(X1, X2) + self.kernel2(X1, X2)
 
class RBFKernel:
    def __init__(self, length_scale=1.0, variance=1.0):
        self.length_scale = length_scale
        self.variance = variance
    
    def __call__(self, X1, X2):
        sq_dist = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'sqeuclidean')
        return self.variance * np.exp(-sq_dist / (2 * self.length_scale**2))
 
class LinearKernel:
    def __init__(self, variance=1.0):
        self.variance = variance
    
    def __call__(self, X1, X2):
        return self.variance * np.outer(X1.flatten(), X2.flatten())
 
# Example: RBF + Linear for trend + variation
k_rbf = RBFKernel(length_scale=1.0, variance=0.5)
k_lin = LinearKernel(variance=0.3)
k_sum = SumKernel(k_rbf, k_lin)
 
X = np.linspace(-3, 3, 100)
K = k_sum(X, X) + 1e-6 * np.eye(100)
L = np.linalg.cholesky(K)
samples = L @ np.random.randn(100, 3)
# samples now have linear trend + smooth local variations

Kernel Multiplication: Modulating Functions

When we multiply kernels, the interpretation is more subtle. The product kernel modulates one function by another, creating interactions between the properties of both kernels.

Key insight: Multiplying by a local kernel (like RBF) makes a global kernel (like Linear or Periodic) become more local in nature.

$$k_{prod}(x, x') = k_1(x, x') \cdot k_2(x, x')$$

The Locally Periodic Kernel

The most important example of kernel multiplication is the locally periodic kernel:

$$k_{LP}(x, x') = k_{Per}(x, x') \cdot k_{RBF}(x, x')$$

This produces functions that are:

Periodic (from $k_{Per}$)
But with periodicity that decays with distance (from $k_{RBF}$)

This is perfect for modeling seasonal patterns that evolve over time, like climate data where seasonal cycles change across decades.

kernel_multiplication.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from scipy.spatial.distance import cdist
 
class ProductKernel:
    """Product of two kernels: k(x,x') = k1(x,x') * k2(x,x')"""
    
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2):
        return self.kernel1(X1, X2) * self.kernel2(X1, X2)
 
class PeriodicKernel:
    def __init__(self, period=1.0, length_scale=1.0, variance=1.0):
        self.period = period
        self.length_scale = length_scale
        self.variance = variance
    
    def __call__(self, X1, X2):
        diffs = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'euclidean')
        arg = np.pi * diffs / self.period
        return self.variance * np.exp(-2 * np.sin(arg)**2 / self.length_scale**2)
 
# Locally Periodic = Periodic × RBF
k_per = PeriodicKernel(period=2.0, length_scale=0.5)
k_rbf = RBFKernel(length_scale=5.0, variance=1.0)
k_locally_periodic = ProductKernel(k_per, k_rbf)
 
# This produces periodic functions that decay with distance

Multiplication Localizes

A useful mental model: multiplying by an RBF kernel 'localizes' any other kernel. The RBF acts as a soft window that limits correlations to nearby points. This is why Periodic × RBF gives locally periodic behavior.

Building Complex Composite Kernels

By combining addition and multiplication, we can build arbitrarily expressive kernels. A classic example from the Gaussian Process literature is modeling CO₂ concentration data, which exhibits:

A long-term rising trend
Seasonal (yearly) periodicity
Medium-term irregularities
Short-term noise

This can be modeled as:

$$k = k_{trend} + k_{seasonal} + k_{medium} + k_{noise}$$

where:

$k_{trend} = k_{RBF}(\ell_{long})$ — smooth long-term trend
$k_{seasonal} = k_{Per} \times k_{RBF}$ — locally periodic yearly cycle
$k_{medium} = k_{RQ}$ — medium-term variations
$k_{noise} = \sigma_n^2 I$ — white noise

composite_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class CompositeKernel:
    """
    Composite kernel for complex time series modeling.
    
    Structure: k = k_trend + k_seasonal + k_medium + k_noise
    """
    
    def __init__(self):
        # Long-term smooth trend (long length scale RBF)
        self.k_trend = RBFKernel(length_scale=50.0, variance=10.0)
        
        # Seasonal: Periodic × RBF (yearly cycle that can drift)
        self.k_periodic = PeriodicKernel(period=1.0, length_scale=0.5)
        self.k_decay = RBFKernel(length_scale=50.0, variance=1.0)
        
        # Medium-term irregularities (RQ for multi-scale)
        self.k_medium = RationalQuadraticKernel(
            length_scale=1.0, alpha=1.0, variance=0.5
        )
        
        # Noise variance
        self.noise_variance = 0.1
    
    def __call__(self, X1, X2):
        # Trend component
        K_trend = self.k_trend(X1, X2)
        
        # Locally periodic seasonal component
        K_seasonal = self.k_periodic(X1, X2) * self.k_decay(X1, X2)
        
        # Medium-term component
        K_medium = self.k_medium(X1, X2)
        
        # Total kernel (noise added separately during inference)
        return K_trend + K_seasonal + K_medium
    
    def with_noise(self, X1, X2):
        K = self(X1, X2)
        if X1.shape == X2.shape and np.allclose(X1, X2):
            K += self.noise_variance * np.eye(len(X1))
        return K

Per-Dimension Length Scales (ARD Preview)

For multi-dimensional inputs, we can assign different length scales to different dimensions. This is called Automatic Relevance Determination (ARD) and will be covered in depth later.

The ARD-RBF kernel:

$$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$

Each dimension $d$ has its own length scale $\ell_d$. If $\ell_d$ is large, the function varies slowly in dimension $d$ (low relevance). If $\ell_d$ is small, the function is sensitive to changes in that dimension (high relevance).

Combining Across Dimensions

We can also use different kernels for different dimensions and combine them via products or sums. For example, k(x) = k_RBF(x₁) × k_Periodic(x₂) models smooth variation in dimension 1 and periodic behavior in dimension 2.

Summary: Kernel Composition

Key Takeaways

•Addition models superposition — The function is a sum of independent components with different characteristics
•Multiplication creates interactions — Product kernels modulate properties, typically localizing global behaviors
•Locally Periodic = Periodic × RBF — The most common product pattern for decaying periodicity
•Complex models = combinations of primitives — Build expressive kernels from RBF, Linear, Periodic building blocks
•Per-dimension kernels via ARD — Different length scales or kernel types for different input dimensions

Page Complete

You now understand how to combine kernels to create expressive models. Next, we'll explore how to optimize the hyperparameters of these kernels using the marginal likelihood.

2 / 5

Loading learning content...

Machine LearningGP Kernels and Hyperparameters

Gaussian Process Kernels and Hyperparameter Optimization

LevelAdvanced

Duration90 mins

TopicGP Kernels and Hyperparameters

2 / 5

Combining Kernels: Building Expressive Models

The Compositional Power of Kernels

The key insight is that certain operations on valid kernels produce new valid kernels. This allows us to compose kernels that capture:

Long-term trends with short-term variations
Periodic patterns that decay over time
Different behaviors in different regions of input space
Multi-scale structure at various resolutions

What You Will Learn

The Algebra of Kernels

Valid kernels form a closed algebra under certain operations. This means combining valid kernels in specific ways always produces another valid kernel.

Fundamental Kernel Combination Rules:

Let $k_1$ and $k_2$ be valid (positive semi-definite) kernels. Then:

Sum: $k(x, x') = k_1(x, x') + k_2(x, x')$ is a valid kernel
Product: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ is a valid kernel
Scalar multiplication: $k(x, x') = c \cdot k_1(x, x')$ is valid for any $c > 0$
Exponentiation: $k(x, x') = \exp(k_1(x, x'))$ is valid if $k_1$ is valid

Subtraction and Division are NOT Valid

Proof Sketch for Sum of Kernels:

If $K_1$ and $K_2$ are positive semi-definite matrices, then for any vector $\mathbf{v}$:

$$\mathbf{v}^T(K_1 + K_2)\mathbf{v} = \mathbf{v}^TK_1\mathbf{v} + \mathbf{v}^TK_2\mathbf{v} \geq 0 + 0 = 0$$

Since both terms are non-negative (by PSD property), their sum is also non-negative.

Proof Sketch for Product of Kernels (Schur Product Theorem):

The element-wise (Hadamard) product of two PSD matrices is also PSD. This is known as the Schur product theorem and follows from the eigenvalue decomposition of the Hadamard product.

Kernel Addition: Superposition of Functions

When we add two kernels, we model the target function as a sum of independent components:

$$f(x) = f_1(x) + f_2(x)$$

where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent.

The covariance of the sum is: $$\text{Cov}(f(x), f(x')) = \text{Cov}(f_1(x) + f_2(x), f_1(x') + f_2(x'))$$ $$= k_1(x, x') + k_2(x, x')$$

Common Kernel Addition Patterns
Combination	Interpretation	Use Case
$k_{RBF} + k_{Lin}$	Smooth variations + linear trend	Data with underlying linear growth
$k_{Per} + k_{RBF}$	Periodic component + aperiodic noise	Seasonal data with irregular variations
$k_{\ell_1} + k_{\ell_2}$	Multi-scale: short + long range	Functions with detail at multiple scales
$k + \sigma_n^2 I$	Signal + independent noise	Standard GP regression noise model

kernel_addition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from scipy.spatial.distance import cdist
 
class SumKernel:
    """Sum of two kernels: k(x,x') = k1(x,x') + k2(x,x')"""
    
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2):
        return self.kernel1(X1, X2) + self.kernel2(X1, X2)
 
class RBFKernel:
    def __init__(self, length_scale=1.0, variance=1.0):
        self.length_scale = length_scale
        self.variance = variance
    
    def __call__(self, X1, X2):
        sq_dist = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'sqeuclidean')
        return self.variance * np.exp(-sq_dist / (2 * self.length_scale**2))
 
class LinearKernel:
    def __init__(self, variance=1.0):
        self.variance = variance
    
    def __call__(self, X1, X2):
        return self.variance * np.outer(X1.flatten(), X2.flatten())
 
# Example: RBF + Linear for trend + variation
k_rbf = RBFKernel(length_scale=1.0, variance=0.5)
k_lin = LinearKernel(variance=0.3)
k_sum = SumKernel(k_rbf, k_lin)
 
X = np.linspace(-3, 3, 100)
K = k_sum(X, X) + 1e-6 * np.eye(100)
L = np.linalg.cholesky(K)
samples = L @ np.random.randn(100, 3)
# samples now have linear trend + smooth local variations

Kernel Multiplication: Modulating Functions

When we multiply kernels, the interpretation is more subtle. The product kernel modulates one function by another, creating interactions between the properties of both kernels.

Key insight: Multiplying by a local kernel (like RBF) makes a global kernel (like Linear or Periodic) become more local in nature.

$$k_{prod}(x, x') = k_1(x, x') \cdot k_2(x, x')$$

The Locally Periodic Kernel

The most important example of kernel multiplication is the locally periodic kernel:

$$k_{LP}(x, x') = k_{Per}(x, x') \cdot k_{RBF}(x, x')$$

This produces functions that are:

Periodic (from $k_{Per}$)
But with periodicity that decays with distance (from $k_{RBF}$)

This is perfect for modeling seasonal patterns that evolve over time, like climate data where seasonal cycles change across decades.

kernel_multiplication.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from scipy.spatial.distance import cdist
 
class ProductKernel:
    """Product of two kernels: k(x,x') = k1(x,x') * k2(x,x')"""
    
    def __init__(self, kernel1, kernel2):
        self.kernel1 = kernel1
        self.kernel2 = kernel2
    
    def __call__(self, X1, X2):
        return self.kernel1(X1, X2) * self.kernel2(X1, X2)
 
class PeriodicKernel:
    def __init__(self, period=1.0, length_scale=1.0, variance=1.0):
        self.period = period
        self.length_scale = length_scale
        self.variance = variance
    
    def __call__(self, X1, X2):
        diffs = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'euclidean')
        arg = np.pi * diffs / self.period
        return self.variance * np.exp(-2 * np.sin(arg)**2 / self.length_scale**2)
 
# Locally Periodic = Periodic × RBF
k_per = PeriodicKernel(period=2.0, length_scale=0.5)
k_rbf = RBFKernel(length_scale=5.0, variance=1.0)
k_locally_periodic = ProductKernel(k_per, k_rbf)
 
# This produces periodic functions that decay with distance

Multiplication Localizes

Building Complex Composite Kernels

By combining addition and multiplication, we can build arbitrarily expressive kernels. A classic example from the Gaussian Process literature is modeling CO₂ concentration data, which exhibits:

A long-term rising trend
Seasonal (yearly) periodicity
Medium-term irregularities
Short-term noise

This can be modeled as:

$$k = k_{trend} + k_{seasonal} + k_{medium} + k_{noise}$$

where:

$k_{trend} = k_{RBF}(\ell_{long})$ — smooth long-term trend
$k_{seasonal} = k_{Per} \times k_{RBF}$ — locally periodic yearly cycle
$k_{medium} = k_{RQ}$ — medium-term variations
$k_{noise} = \sigma_n^2 I$ — white noise

composite_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class CompositeKernel:
    """
    Composite kernel for complex time series modeling.
    
    Structure: k = k_trend + k_seasonal + k_medium + k_noise
    """
    
    def __init__(self):
        # Long-term smooth trend (long length scale RBF)
        self.k_trend = RBFKernel(length_scale=50.0, variance=10.0)
        
        # Seasonal: Periodic × RBF (yearly cycle that can drift)
        self.k_periodic = PeriodicKernel(period=1.0, length_scale=0.5)
        self.k_decay = RBFKernel(length_scale=50.0, variance=1.0)
        
        # Medium-term irregularities (RQ for multi-scale)
        self.k_medium = RationalQuadraticKernel(
            length_scale=1.0, alpha=1.0, variance=0.5
        )
        
        # Noise variance
        self.noise_variance = 0.1
    
    def __call__(self, X1, X2):
        # Trend component
        K_trend = self.k_trend(X1, X2)
        
        # Locally periodic seasonal component
        K_seasonal = self.k_periodic(X1, X2) * self.k_decay(X1, X2)
        
        # Medium-term component
        K_medium = self.k_medium(X1, X2)
        
        # Total kernel (noise added separately during inference)
        return K_trend + K_seasonal + K_medium
    
    def with_noise(self, X1, X2):
        K = self(X1, X2)
        if X1.shape == X2.shape and np.allclose(X1, X2):
            K += self.noise_variance * np.eye(len(X1))
        return K

Per-Dimension Length Scales (ARD Preview)

For multi-dimensional inputs, we can assign different length scales to different dimensions. This is called Automatic Relevance Determination (ARD) and will be covered in depth later.

The ARD-RBF kernel:

$$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$

Combining Across Dimensions

Summary: Kernel Composition

Key Takeaways

•Addition models superposition — The function is a sum of independent components with different characteristics
•Multiplication creates interactions — Product kernels modulate properties, typically localizing global behaviors
•Locally Periodic = Periodic × RBF — The most common product pattern for decaying periodicity
•Complex models = combinations of primitives — Build expressive kernels from RBF, Linear, Periodic building blocks
•Per-dimension kernels via ARD — Different length scales or kernel types for different input dimensions

Page Complete

You now understand how to combine kernels to create expressive models. Next, we'll explore how to optimize the hyperparameters of these kernels using the marginal likelihood.

2 / 5