Kernel Design - Learning Module

Loading content...

0/245

Spectral Analysis

Kernels Through the Frequency Lens

Every stationary kernel has a dual representation in the frequency domain. This spectral perspective reveals what kinds of oscillations and patterns a kernel encodes, provides tools for designing kernels with specific frequency content, and underlies powerful methods like the Spectral Mixture Kernel.

The connection is established by Bochner's theorem: a continuous, stationary kernel is positive definite if and only if it is the Fourier transform of a non-negative measure (the spectral density). This isn't just mathematical elegance—it's a practical tool for understanding and designing kernels.

What You Will Learn

By the end of this page, you will understand: (1) Bochner's theorem and the kernel-spectrum duality, (2) spectral densities of common kernels (RBF, Matérn), (3) the Spectral Mixture Kernel for flexible modeling, (4) how to interpret and design kernels spectrally, (5) connections to random Fourier features for scalable GPs.

Bochner's Theorem: The Kernel-Spectrum Duality

Bochner's Theorem provides a complete characterization of valid stationary kernels through their spectral representation.

Theorem (Bochner, 1932)

A continuous function $k: \mathbb{R}^d \to \mathbb{R}$ is a valid positive definite kernel if and only if it can be written as:

$$k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{2\pi i \boldsymbol{\omega}^\top \boldsymbol{\tau}} S(\boldsymbol{\omega}) , d\boldsymbol{\omega}$$

where $\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}'$ and $S(\boldsymbol{\omega}) \geq 0$ is the spectral density (a non-negative function integrating to $k(0)$).

Interpretation:

$S(\boldsymbol{\omega})$ tells us how much "power" the GP has at frequency $\boldsymbol{\omega}$
High $S(\boldsymbol{\omega})$ means the GP can produce oscillations at that frequency
Zero $S(\boldsymbol{\omega})$ means that frequency component is impossible

Fourier Transform Relationship

The kernel and spectral density form a Fourier transform pair: k(τ) = F⁻¹[S(ω)] and S(ω) = F[k(τ)]. To find the spectrum of a kernel, compute its Fourier transform. To construct a kernel from a desired spectrum, compute the inverse Fourier transform.

Why the Spectral View Matters

Kernel design: Want a GP that oscillates at specific frequencies? Define $S(\omega)$ with peaks there.
Understanding smoothness: The decay of $S(\omega)$ at high frequencies determines differentiability.
Approximation methods: Random Fourier Features approximate $k$ by sampling from $S$.
Diagnosing fit: Compare data's empirical spectrum to the model's spectral density.
Combining kernels: Kernel addition = spectral density addition (easy to understand in frequency domain).

Spectral Densities of Common Kernels

RBF (Squared Exponential) Kernel

Kernel: $k(r) = \sigma^2 \exp\left(-\frac{r^2}{2\ell^2}\right)$

Spectral density (1D): $S(\omega) = \sigma^2 \ell \sqrt{2\pi} \exp\left(-2\pi^2 \ell^2 \omega^2\right)$

The RBF has a Gaussian spectrum—it decays extremely rapidly at high frequencies. This is why RBF produces infinitely smooth functions: high-frequency components are exponentially suppressed.

Matérn Kernel

The Matérn-$\nu$ kernel has spectral density (1D):

$$S(\omega) = \sigma^2 \frac{2^d \pi^{d/2} \Gamma(\nu + d/2)(2\nu)^\nu}{\Gamma(\nu)\ell^{2\nu}} \left(\frac{2\nu}{\ell^2} + 4\pi^2\omega^2\right)^{-(\nu + d/2)}$$

This is a Student-t shaped spectrum. It decays polynomially (not exponentially), allowing more high-frequency content. The smaller $\nu$, the slower the decay, the rougher the functions.

Spectral Properties of Common Kernels
Kernel	Spectral Density Shape	High-Frequency Decay	Function Smoothness
RBF	Gaussian	Exponential (very fast)	C^∞ (infinitely differentiable)
Matérn-5/2	Student-t (df=5)	Polynomial ~ ω^(-5)	C^2 (twice differentiable)
Matérn-3/2	Student-t (df=3)	Polynomial ~ ω^(-3)	C^1 (once differentiable)
Matérn-1/2 (Exponential)	Cauchy	Polynomial ~ ω^(-1)	C^0 (continuous only)
Periodic	Discrete spikes	No decay (pure frequencies)	C^∞ at fundamental

spectral_densities.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import gamma
 
def rbf_spectrum(omega, sigma_sq=1.0, length_scale=1.0):
    """Spectral density of RBF kernel (1D)."""
    return sigma_sq * length_scale * np.sqrt(2*np.pi) * \
           np.exp(-2 * np.pi**2 * length_scale**2 * omega**2)
 
def matern_spectrum(omega, sigma_sq=1.0, length_scale=1.0, nu=2.5):
    """Spectral density of Matérn kernel (1D)."""
    d = 1  # dimension
    coef = sigma_sq * (2**d * np.pi**(d/2) * gamma(nu + d/2) * (2*nu)**nu) / \
           (gamma(nu) * length_scale**(2*nu))
    return coef * (2*nu/length_scale**2 + 4*np.pi**2*omega**2)**(-(nu + d/2))
 
def visualize_spectra():
    """Compare spectral densities of different kernels."""
    omega = np.linspace(-3, 3, 500)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: Different kernels, same length-scale
    axes[0].plot(omega, rbf_spectrum(omega, length_scale=0.5), 
                 label='RBF', linewidth=2)
    axes[0].plot(omega, matern_spectrum(omega, length_scale=0.5, nu=2.5), 
                 label='Matérn-5/2', linewidth=2)
    axes[0].plot(omega, matern_spectrum(omega, length_scale=0.5, nu=1.5), 
                 label='Matérn-3/2', linewidth=2)
    axes[0].plot(omega, matern_spectrum(omega, length_scale=0.5, nu=0.5), 
                 label='Matérn-1/2', linewidth=2)
    axes[0].set_title('Spectral Densities (ℓ=0.5)')
    axes[0].set_xlabel('Frequency ω')
    axes[0].set_ylabel('S(ω)')
    axes[0].legend()
    axes[0].set_yscale('log')
    axes[0].grid(True, alpha=0.3)
    
    # Right: RBF with different length-scales
    for ell in [0.3, 0.5, 1.0, 2.0]:
        axes[1].plot(omega, rbf_spectrum(omega, length_scale=ell), 
                     label=f'ℓ={ell}', linewidth=2)
    axes[1].set_title('RBF Spectrum vs Length-Scale')
    axes[1].set_xlabel('Frequency ω')
    axes[1].set_ylabel('S(ω)')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
visualize_spectra()

Length-Scale and Spectral Width

There's an inverse relationship: short length-scale ↔ wide spectrum (more high-frequency content, more wiggly), long length-scale ↔ narrow spectrum (concentrated at low frequencies, smoother). This is the uncertainty principle in action.

The Spectral Mixture Kernel

The Spectral Mixture (SM) Kernel, introduced by Andrew Gordon Wilson in 2013, leverages the spectral representation to create an extraordinarily flexible kernel that can approximate any stationary kernel.

Key Idea: Model the spectral density as a mixture of Gaussians:

$$S(\omega) = \sum_{q=1}^{Q} w_q \mathcal{N}(\omega; \mu_q, \sigma_q^2)$$

Each Gaussian component contributes a different frequency band. Taking the inverse Fourier transform gives the kernel:

$$k(\tau) = \sum_{q=1}^{Q} w_q \exp(-2\pi^2 \sigma_q^2 \tau^2) \cos(2\pi \mu_q \tau)$$

Interpretation of Parameters:

$\mu_q$: Center frequency of component $q$ (where oscillations occur)
$\sigma_q$: Bandwidth (how spread out the frequency band is)
$w_q$: Weight/importance of this frequency component

Universal Approximation

Since any spectral density can be approximated by a Gaussian mixture (GMM), the SM kernel can approximate ANY stationary kernel. With enough components Q, you can match arbitrary correlation structures. This makes it ideal for automatic pattern discovery.

spectral_mixture_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
import matplotlib.pyplot as plt
 
def spectral_mixture_kernel(tau, weights, means, variances):
    """
    Compute the Spectral Mixture kernel.
    
    Parameters:
    -----------
    tau : array-like
        Lag/distance values
    weights : array of shape (Q,)
        Component weights w_q
    means : array of shape (Q,)
        Component center frequencies μ_q
    variances : array of shape (Q,)
        Component bandwidths σ_q²
    """
    tau = np.atleast_1d(tau)
    K = np.zeros((len(tau), len(tau)))
    
    for w, mu, sigma_sq in zip(weights, means, variances):
        # Each component: w * exp(-2π²σ²τ²) * cos(2πμτ)
        tau_grid = np.subtract.outer(tau, tau)
        decay = np.exp(-2 * np.pi**2 * sigma_sq * tau_grid**2)
        oscillation = np.cos(2 * np.pi * mu * tau_grid)
        K += w * decay * oscillation
    
    return K
 
def demo_spectral_mixture():
    """Demonstrate SM kernel flexibility."""
    X = np.linspace(0, 10, 200)
    n = len(X)
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Example 1: Single low-frequency component (smooth)
    K1 = spectral_mixture_kernel(X, weights=[1.0], means=[0.0], variances=[0.1])
    K1 += 1e-6 * np.eye(n)
    
    # Example 2: Single high-frequency (oscillatory)
    K2 = spectral_mixture_kernel(X, weights=[1.0], means=[1.0], variances=[0.05])
    K2 += 1e-6 * np.eye(n)
    
    # Example 3: Two frequencies (complex pattern)
    K3 = spectral_mixture_kernel(X, weights=[0.5, 0.5], 
                                  means=[0.3, 1.5], 
                                  variances=[0.02, 0.02])
    K3 += 1e-6 * np.eye(n)
    
    kernels = [K1, K2, K3]
    titles = ['Low Frequency (μ=0)', 'High Frequency (μ=1)', 
              'Two Frequencies (μ=0.3, 1.5)']
    
    np.random.seed(42)
    for col, (K, title) in enumerate(zip(kernels, titles)):
        # Spectral density visualization
        omega = np.linspace(-2, 2, 200)
        if col == 0:
            S = 1.0 * np.exp(-omega**2 / (2*0.1))
        elif col == 1:
            S = 1.0 * np.exp(-(omega-1)**2 / (2*0.05))
        else:
            S = 0.5 * np.exp(-(omega-0.3)**2 / (2*0.02)) + \
                0.5 * np.exp(-(omega-1.5)**2 / (2*0.02))
        axes[0, col].plot(omega, S, 'b-', linewidth=2)
        axes[0, col].fill_between(omega, 0, S, alpha=0.3)
        axes[0, col].set_title(f'Spectrum: {title}')
        axes[0, col].set_xlabel('Frequency ω')
        axes[0, col].set_ylabel('S(ω)')
        axes[0, col].grid(True, alpha=0.3)
        
        # GP samples
        L = np.linalg.cholesky(K)
        for i in range(3):
            sample = L @ np.random.randn(n)
            axes[1, col].plot(X, sample, alpha=0.7)
        axes[1, col].set_title('GP Samples')
        axes[1, col].set_xlabel('x')
        axes[1, col].set_ylabel('f(x)')
        axes[1, col].grid(True, alpha=0.3)
    
    plt.suptitle('Spectral Mixture Kernel: Flexible Pattern Discovery', fontsize=14)
    plt.tight_layout()
    return fig
 
demo_spectral_mixture()

Practical Use of SM Kernels

Initialization: Start with components spread across the expected frequency range
Number of components: Begin with Q=5-10; add more if residual structure persists
Optimization: Use marginal likelihood; the landscape can have local optima
Interpretation: After fitting, examine which μ_q have large weights—these are the data's dominant frequencies
Regularization: Without priors on weights, some components may vanish or become redundant

When to Use SM Kernels:

Complex time series with unknown, multiple periodicities
Automatic pattern discovery when domain knowledge is limited
As a flexible baseline to compare against structured kernels

Random Fourier Features

The spectral representation enables efficient approximation of GP kernels via Random Fourier Features (RFF), a technique that reduces the O(n³) cost of exact GPs.

The Key Insight

By Bochner's theorem and the inverse Fourier transform:

$$k(\mathbf{x} - \mathbf{x}') = \int S(\boldsymbol{\omega}) e^{i\boldsymbol{\omega}^\top(\mathbf{x} - \mathbf{x}')} d\boldsymbol{\omega} = \mathbb{E}{\boldsymbol{\omega} \sim p}[\phi{\boldsymbol{\omega}}(\mathbf{x})^* \phi_{\boldsymbol{\omega}}(\mathbf{x}')]$$

where $p(\boldsymbol{\omega}) \propto S(\boldsymbol{\omega})$ and $\phi_{\boldsymbol{\omega}}(\mathbf{x}) = e^{i\boldsymbol{\omega}^\top\mathbf{x}}$.

Monte Carlo Approximation

Sample $D$ frequencies $\boldsymbol{\omega}_1, ..., \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega})$ and approximate:

$$k(\mathbf{x}, \mathbf{x}') \approx \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$$

where $\boldsymbol{\phi}(\mathbf{x}) = \frac{1}{\sqrt{D}}[\cos(\boldsymbol{\omega}_1^\top\mathbf{x} + b_1), ..., \cos(\boldsymbol{\omega}_D^\top\mathbf{x} + b_D)]$

random_fourier_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
 
class RBFRandomFourierFeatures:
    """
    Random Fourier Features approximation for RBF kernel.
    """
    def __init__(self, n_features=100, length_scale=1.0, seed=42):
        self.D = n_features
        self.length_scale = length_scale
        np.random.seed(seed)
        # Sample frequencies from spectral density (Gaussian for RBF)
        self.omega = np.random.randn(n_features) / length_scale
        self.b = np.random.uniform(0, 2*np.pi, n_features)
    
    def transform(self, X):
        """Map inputs to random feature space."""
        X = np.atleast_1d(X).reshape(-1, 1)
        # φ(x) = sqrt(2/D) * cos(ωᵀx + b)
        projection = X @ self.omega.reshape(1, -1) + self.b
        return np.sqrt(2/self.D) * np.cos(projection)
    
    def approximate_kernel(self, X1, X2):
        """Approximate k(X1, X2) ≈ φ(X1)ᵀφ(X2)."""
        phi1 = self.transform(X1)
        phi2 = self.transform(X2)
        return phi1 @ phi2.T
 
def demo_rff_approximation():
    """Show how RFF approximation improves with more features."""
    X = np.linspace(-3, 3, 100)
    
    # True RBF kernel
    sq_dists = np.subtract.outer(X, X)**2
    K_true = np.exp(-sq_dists / 2)
    
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    n_features_list = [10, 50, 200, 1000]
    for ax, D in zip(axes, n_features_list):
        rff = RBFRandomFourierFeatures(n_features=D, length_scale=1.0)
        K_approx = rff.approximate_kernel(X, X)
        
        error = np.abs(K_true - K_approx)
        max_error = np.max(error)
        
        im = ax.imshow(K_approx, extent=[-3, 3, 3, -3], cmap='viridis')
        ax.set_title(f'D={D} features\nMax error: {max_error:.4f}')
        ax.set_xlabel("x'")
        ax.set_ylabel('x')
    
    plt.suptitle('RFF Approximation Quality vs Number of Features', fontsize=14)
    plt.tight_layout()
    return fig
 
demo_rff_approximation()

Computational Benefit

With RFF, GP regression costs O(nD² + D³) instead of O(n³). For large n and moderate D, this is a huge saving. With D=1000 features, you can handle millions of data points that would be impossible with exact GPs.

Spectral Perspective on Kernel Design

The spectral view provides powerful intuitions for kernel design and diagnosis.

Spectral Design Principles

•Smoothness ↔ High-frequency decay: Faster spectral decay → smoother functions. RBF (Gaussian spectrum) is smoothest; Matérn-1/2 (Cauchy spectrum) is roughest.
•Length-scale ↔ Spectral width: Short ℓ → wide spectrum → wiggly. Long ℓ → narrow spectrum → smooth.
•Periodicity ↔ Spectral peaks: Periodic kernels have discrete spectral spikes. The period p determines spike locations at multiples of 1/p.
•Kernel addition ↔ Spectral addition: Adding kernels adds their spectral densities. Easy to understand composite behavior!
•Kernel multiplication: More complex in spectral domain (convolution), but products typically narrow the effective spectrum.

Diagnosing Model-Data Mismatch

Compute the empirical spectral density of your data (via FFT on residuals or raw data)
Compare with the model's spectral density
Mismatches reveal missing structure:
- Peak in data spectrum not in model → add component at that frequency
- Model spectrum too wide → consider longer length-scale
- Model too smooth at high frequencies → switch from RBF to Matérn

Designing Custom Kernels

To create a kernel with specific frequency content:

Define desired spectral density $S(\omega)$ (must be non-negative)
Ensure $S$ is symmetric for real-valued kernels
Inverse Fourier transform gives the kernel (may not have closed form)
Or: approximate with Spectral Mixture Kernel

Summary: Spectral Analysis

Key Takeaways

•Bochner's theorem: Stationary kernels ↔ Fourier transforms of non-negative spectral densities
•RBF has Gaussian spectrum: Exponential high-frequency decay → infinite smoothness
•Matérn has Student-t spectrum: Polynomial decay → controllable smoothness
•Spectral Mixture Kernel: Gaussian mixture in frequency domain; can approximate any stationary kernel
•Random Fourier Features: Sample from spectrum to approximate kernel; enables scaling to large datasets
•Spectral diagnosis: Compare data's empirical spectrum to model's spectrum to identify missing structure

What's Next:

We've covered kernel design from multiple angles: individual kernels, composition, and spectral analysis. The final page addresses Automatic Relevance Determination (ARD)—using per-dimension length-scales for feature selection and handling high-dimensional inputs.

Page Complete

You now understand the spectral perspective on kernels: Bochner's theorem, spectral densities of common kernels, the powerful Spectral Mixture Kernel, and Random Fourier Features for scalability. This frequency-domain view completes your kernel design toolkit.