Loading learning content...
One of the most powerful features of kernel-based methods is that kernels can be combined to create new, more expressive kernels. Just as we build complex programs from simple functions, we can build complex statistical models from simple kernel building blocks.
The key insight is that certain operations on valid kernels produce new valid kernels. This allows us to compose kernels that capture:
By the end of this page, you will understand the mathematical rules for combining kernels, master the art of kernel addition and multiplication, and be able to design composite kernels that encode complex domain knowledge into your Gaussian Process models.
Valid kernels form a closed algebra under certain operations. This means combining valid kernels in specific ways always produces another valid kernel.
Fundamental Kernel Combination Rules:
Let $k_1$ and $k_2$ be valid (positive semi-definite) kernels. Then:
Sum: $k(x, x') = k_1(x, x') + k_2(x, x')$ is a valid kernel
Product: $k(x, x') = k_1(x, x') \cdot k_2(x, x')$ is a valid kernel
Scalar multiplication: $k(x, x') = c \cdot k_1(x, x')$ is valid for any $c > 0$
Exponentiation: $k(x, x') = \exp(k_1(x, x'))$ is valid if $k_1$ is valid
Subtracting kernels ($k_1 - k_2$) or dividing kernels ($k_1 / k_2$) does NOT generally produce a valid kernel. The resulting matrix may not be positive semi-definite. Always use addition and multiplication for kernel composition.
Proof Sketch for Sum of Kernels:
If $K_1$ and $K_2$ are positive semi-definite matrices, then for any vector $\mathbf{v}$:
$$\mathbf{v}^T(K_1 + K_2)\mathbf{v} = \mathbf{v}^TK_1\mathbf{v} + \mathbf{v}^TK_2\mathbf{v} \geq 0 + 0 = 0$$
Since both terms are non-negative (by PSD property), their sum is also non-negative.
Proof Sketch for Product of Kernels (Schur Product Theorem):
The element-wise (Hadamard) product of two PSD matrices is also PSD. This is known as the Schur product theorem and follows from the eigenvalue decomposition of the Hadamard product.
When we add two kernels, we model the target function as a sum of independent components:
$$f(x) = f_1(x) + f_2(x)$$
where $f_1 \sim \mathcal{GP}(0, k_1)$ and $f_2 \sim \mathcal{GP}(0, k_2)$ are independent.
The covariance of the sum is: $$\text{Cov}(f(x), f(x')) = \text{Cov}(f_1(x) + f_2(x), f_1(x') + f_2(x'))$$ $$= k_1(x, x') + k_2(x, x')$$
| Combination | Interpretation | Use Case |
|---|---|---|
| $k_{RBF} + k_{Lin}$ | Smooth variations + linear trend | Data with underlying linear growth |
| $k_{Per} + k_{RBF}$ | Periodic component + aperiodic noise | Seasonal data with irregular variations |
| $k_{\ell_1} + k_{\ell_2}$ | Multi-scale: short + long range | Functions with detail at multiple scales |
| $k + \sigma_n^2 I$ | Signal + independent noise | Standard GP regression noise model |
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as npfrom scipy.spatial.distance import cdist class SumKernel: """Sum of two kernels: k(x,x') = k1(x,x') + k2(x,x')""" def __init__(self, kernel1, kernel2): self.kernel1 = kernel1 self.kernel2 = kernel2 def __call__(self, X1, X2): return self.kernel1(X1, X2) + self.kernel2(X1, X2) class RBFKernel: def __init__(self, length_scale=1.0, variance=1.0): self.length_scale = length_scale self.variance = variance def __call__(self, X1, X2): sq_dist = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'sqeuclidean') return self.variance * np.exp(-sq_dist / (2 * self.length_scale**2)) class LinearKernel: def __init__(self, variance=1.0): self.variance = variance def __call__(self, X1, X2): return self.variance * np.outer(X1.flatten(), X2.flatten()) # Example: RBF + Linear for trend + variationk_rbf = RBFKernel(length_scale=1.0, variance=0.5)k_lin = LinearKernel(variance=0.3)k_sum = SumKernel(k_rbf, k_lin) X = np.linspace(-3, 3, 100)K = k_sum(X, X) + 1e-6 * np.eye(100)L = np.linalg.cholesky(K)samples = L @ np.random.randn(100, 3)# samples now have linear trend + smooth local variationsWhen we multiply kernels, the interpretation is more subtle. The product kernel modulates one function by another, creating interactions between the properties of both kernels.
Key insight: Multiplying by a local kernel (like RBF) makes a global kernel (like Linear or Periodic) become more local in nature.
$$k_{prod}(x, x') = k_1(x, x') \cdot k_2(x, x')$$
The Locally Periodic Kernel
The most important example of kernel multiplication is the locally periodic kernel:
$$k_{LP}(x, x') = k_{Per}(x, x') \cdot k_{RBF}(x, x')$$
This produces functions that are:
This is perfect for modeling seasonal patterns that evolve over time, like climate data where seasonal cycles change across decades.
123456789101112131415161718192021222324252627282930
import numpy as npfrom scipy.spatial.distance import cdist class ProductKernel: """Product of two kernels: k(x,x') = k1(x,x') * k2(x,x')""" def __init__(self, kernel1, kernel2): self.kernel1 = kernel1 self.kernel2 = kernel2 def __call__(self, X1, X2): return self.kernel1(X1, X2) * self.kernel2(X1, X2) class PeriodicKernel: def __init__(self, period=1.0, length_scale=1.0, variance=1.0): self.period = period self.length_scale = length_scale self.variance = variance def __call__(self, X1, X2): diffs = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'euclidean') arg = np.pi * diffs / self.period return self.variance * np.exp(-2 * np.sin(arg)**2 / self.length_scale**2) # Locally Periodic = Periodic × RBFk_per = PeriodicKernel(period=2.0, length_scale=0.5)k_rbf = RBFKernel(length_scale=5.0, variance=1.0)k_locally_periodic = ProductKernel(k_per, k_rbf) # This produces periodic functions that decay with distanceA useful mental model: multiplying by an RBF kernel 'localizes' any other kernel. The RBF acts as a soft window that limits correlations to nearby points. This is why Periodic × RBF gives locally periodic behavior.
By combining addition and multiplication, we can build arbitrarily expressive kernels. A classic example from the Gaussian Process literature is modeling CO₂ concentration data, which exhibits:
This can be modeled as:
$$k = k_{trend} + k_{seasonal} + k_{medium} + k_{noise}$$
where:
1234567891011121314151617181920212223242526272829303132333435363738394041
class CompositeKernel: """ Composite kernel for complex time series modeling. Structure: k = k_trend + k_seasonal + k_medium + k_noise """ def __init__(self): # Long-term smooth trend (long length scale RBF) self.k_trend = RBFKernel(length_scale=50.0, variance=10.0) # Seasonal: Periodic × RBF (yearly cycle that can drift) self.k_periodic = PeriodicKernel(period=1.0, length_scale=0.5) self.k_decay = RBFKernel(length_scale=50.0, variance=1.0) # Medium-term irregularities (RQ for multi-scale) self.k_medium = RationalQuadraticKernel( length_scale=1.0, alpha=1.0, variance=0.5 ) # Noise variance self.noise_variance = 0.1 def __call__(self, X1, X2): # Trend component K_trend = self.k_trend(X1, X2) # Locally periodic seasonal component K_seasonal = self.k_periodic(X1, X2) * self.k_decay(X1, X2) # Medium-term component K_medium = self.k_medium(X1, X2) # Total kernel (noise added separately during inference) return K_trend + K_seasonal + K_medium def with_noise(self, X1, X2): K = self(X1, X2) if X1.shape == X2.shape and np.allclose(X1, X2): K += self.noise_variance * np.eye(len(X1)) return KFor multi-dimensional inputs, we can assign different length scales to different dimensions. This is called Automatic Relevance Determination (ARD) and will be covered in depth later.
The ARD-RBF kernel:
$$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$
Each dimension $d$ has its own length scale $\ell_d$. If $\ell_d$ is large, the function varies slowly in dimension $d$ (low relevance). If $\ell_d$ is small, the function is sensitive to changes in that dimension (high relevance).
We can also use different kernels for different dimensions and combine them via products or sums. For example, k(x) = k_RBF(x₁) × k_Periodic(x₂) models smooth variation in dimension 1 and periodic behavior in dimension 2.
You now understand how to combine kernels to create expressive models. Next, we'll explore how to optimize the hyperparameters of these kernels using the marginal likelihood.