Loading content...
Every stationary kernel has a dual representation in the frequency domain. This spectral perspective reveals what kinds of oscillations and patterns a kernel encodes, provides tools for designing kernels with specific frequency content, and underlies powerful methods like the Spectral Mixture Kernel.
The connection is established by Bochner's theorem: a continuous, stationary kernel is positive definite if and only if it is the Fourier transform of a non-negative measure (the spectral density). This isn't just mathematical elegance—it's a practical tool for understanding and designing kernels.
By the end of this page, you will understand: (1) Bochner's theorem and the kernel-spectrum duality, (2) spectral densities of common kernels (RBF, Matérn), (3) the Spectral Mixture Kernel for flexible modeling, (4) how to interpret and design kernels spectrally, (5) connections to random Fourier features for scalable GPs.
Bochner's Theorem provides a complete characterization of valid stationary kernels through their spectral representation.
Theorem (Bochner, 1932)
A continuous function $k: \mathbb{R}^d \to \mathbb{R}$ is a valid positive definite kernel if and only if it can be written as:
$$k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{2\pi i \boldsymbol{\omega}^\top \boldsymbol{\tau}} S(\boldsymbol{\omega}) , d\boldsymbol{\omega}$$
where $\boldsymbol{\tau} = \mathbf{x} - \mathbf{x}'$ and $S(\boldsymbol{\omega}) \geq 0$ is the spectral density (a non-negative function integrating to $k(0)$).
Interpretation:
The kernel and spectral density form a Fourier transform pair: k(τ) = F⁻¹[S(ω)] and S(ω) = F[k(τ)]. To find the spectrum of a kernel, compute its Fourier transform. To construct a kernel from a desired spectrum, compute the inverse Fourier transform.
Why the Spectral View Matters
Kernel design: Want a GP that oscillates at specific frequencies? Define $S(\omega)$ with peaks there.
Understanding smoothness: The decay of $S(\omega)$ at high frequencies determines differentiability.
Approximation methods: Random Fourier Features approximate $k$ by sampling from $S$.
Diagnosing fit: Compare data's empirical spectrum to the model's spectral density.
Combining kernels: Kernel addition = spectral density addition (easy to understand in frequency domain).
RBF (Squared Exponential) Kernel
Kernel: $k(r) = \sigma^2 \exp\left(-\frac{r^2}{2\ell^2}\right)$
Spectral density (1D): $S(\omega) = \sigma^2 \ell \sqrt{2\pi} \exp\left(-2\pi^2 \ell^2 \omega^2\right)$
The RBF has a Gaussian spectrum—it decays extremely rapidly at high frequencies. This is why RBF produces infinitely smooth functions: high-frequency components are exponentially suppressed.
Matérn Kernel
The Matérn-$\nu$ kernel has spectral density (1D):
$$S(\omega) = \sigma^2 \frac{2^d \pi^{d/2} \Gamma(\nu + d/2)(2\nu)^\nu}{\Gamma(\nu)\ell^{2\nu}} \left(\frac{2\nu}{\ell^2} + 4\pi^2\omega^2\right)^{-(\nu + d/2)}$$
This is a Student-t shaped spectrum. It decays polynomially (not exponentially), allowing more high-frequency content. The smaller $\nu$, the slower the decay, the rougher the functions.
| Kernel | Spectral Density Shape | High-Frequency Decay | Function Smoothness |
|---|---|---|---|
| RBF | Gaussian | Exponential (very fast) | C^∞ (infinitely differentiable) |
| Matérn-5/2 | Student-t (df=5) | Polynomial ~ ω^(-5) | C^2 (twice differentiable) |
| Matérn-3/2 | Student-t (df=3) | Polynomial ~ ω^(-3) | C^1 (once differentiable) |
| Matérn-1/2 (Exponential) | Cauchy | Polynomial ~ ω^(-1) | C^0 (continuous only) |
| Periodic | Discrete spikes | No decay (pure frequencies) | C^∞ at fundamental |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import gamma def rbf_spectrum(omega, sigma_sq=1.0, length_scale=1.0): """Spectral density of RBF kernel (1D).""" return sigma_sq * length_scale * np.sqrt(2*np.pi) * \ np.exp(-2 * np.pi**2 * length_scale**2 * omega**2) def matern_spectrum(omega, sigma_sq=1.0, length_scale=1.0, nu=2.5): """Spectral density of Matérn kernel (1D).""" d = 1 # dimension coef = sigma_sq * (2**d * np.pi**(d/2) * gamma(nu + d/2) * (2*nu)**nu) / \ (gamma(nu) * length_scale**(2*nu)) return coef * (2*nu/length_scale**2 + 4*np.pi**2*omega**2)**(-(nu + d/2)) def visualize_spectra(): """Compare spectral densities of different kernels.""" omega = np.linspace(-3, 3, 500) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Different kernels, same length-scale axes[0].plot(omega, rbf_spectrum(omega, length_scale=0.5), label='RBF', linewidth=2) axes[0].plot(omega, matern_spectrum(omega, length_scale=0.5, nu=2.5), label='Matérn-5/2', linewidth=2) axes[0].plot(omega, matern_spectrum(omega, length_scale=0.5, nu=1.5), label='Matérn-3/2', linewidth=2) axes[0].plot(omega, matern_spectrum(omega, length_scale=0.5, nu=0.5), label='Matérn-1/2', linewidth=2) axes[0].set_title('Spectral Densities (ℓ=0.5)') axes[0].set_xlabel('Frequency ω') axes[0].set_ylabel('S(ω)') axes[0].legend() axes[0].set_yscale('log') axes[0].grid(True, alpha=0.3) # Right: RBF with different length-scales for ell in [0.3, 0.5, 1.0, 2.0]: axes[1].plot(omega, rbf_spectrum(omega, length_scale=ell), label=f'ℓ={ell}', linewidth=2) axes[1].set_title('RBF Spectrum vs Length-Scale') axes[1].set_xlabel('Frequency ω') axes[1].set_ylabel('S(ω)') axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() return fig visualize_spectra()There's an inverse relationship: short length-scale ↔ wide spectrum (more high-frequency content, more wiggly), long length-scale ↔ narrow spectrum (concentrated at low frequencies, smoother). This is the uncertainty principle in action.
The Spectral Mixture (SM) Kernel, introduced by Andrew Gordon Wilson in 2013, leverages the spectral representation to create an extraordinarily flexible kernel that can approximate any stationary kernel.
Key Idea: Model the spectral density as a mixture of Gaussians:
$$S(\omega) = \sum_{q=1}^{Q} w_q \mathcal{N}(\omega; \mu_q, \sigma_q^2)$$
Each Gaussian component contributes a different frequency band. Taking the inverse Fourier transform gives the kernel:
$$k(\tau) = \sum_{q=1}^{Q} w_q \exp(-2\pi^2 \sigma_q^2 \tau^2) \cos(2\pi \mu_q \tau)$$
Interpretation of Parameters:
Since any spectral density can be approximated by a Gaussian mixture (GMM), the SM kernel can approximate ANY stationary kernel. With enough components Q, you can match arbitrary correlation structures. This makes it ideal for automatic pattern discovery.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npimport matplotlib.pyplot as plt def spectral_mixture_kernel(tau, weights, means, variances): """ Compute the Spectral Mixture kernel. Parameters: ----------- tau : array-like Lag/distance values weights : array of shape (Q,) Component weights w_q means : array of shape (Q,) Component center frequencies μ_q variances : array of shape (Q,) Component bandwidths σ_q² """ tau = np.atleast_1d(tau) K = np.zeros((len(tau), len(tau))) for w, mu, sigma_sq in zip(weights, means, variances): # Each component: w * exp(-2π²σ²τ²) * cos(2πμτ) tau_grid = np.subtract.outer(tau, tau) decay = np.exp(-2 * np.pi**2 * sigma_sq * tau_grid**2) oscillation = np.cos(2 * np.pi * mu * tau_grid) K += w * decay * oscillation return K def demo_spectral_mixture(): """Demonstrate SM kernel flexibility.""" X = np.linspace(0, 10, 200) n = len(X) fig, axes = plt.subplots(2, 3, figsize=(15, 10)) # Example 1: Single low-frequency component (smooth) K1 = spectral_mixture_kernel(X, weights=[1.0], means=[0.0], variances=[0.1]) K1 += 1e-6 * np.eye(n) # Example 2: Single high-frequency (oscillatory) K2 = spectral_mixture_kernel(X, weights=[1.0], means=[1.0], variances=[0.05]) K2 += 1e-6 * np.eye(n) # Example 3: Two frequencies (complex pattern) K3 = spectral_mixture_kernel(X, weights=[0.5, 0.5], means=[0.3, 1.5], variances=[0.02, 0.02]) K3 += 1e-6 * np.eye(n) kernels = [K1, K2, K3] titles = ['Low Frequency (μ=0)', 'High Frequency (μ=1)', 'Two Frequencies (μ=0.3, 1.5)'] np.random.seed(42) for col, (K, title) in enumerate(zip(kernels, titles)): # Spectral density visualization omega = np.linspace(-2, 2, 200) if col == 0: S = 1.0 * np.exp(-omega**2 / (2*0.1)) elif col == 1: S = 1.0 * np.exp(-(omega-1)**2 / (2*0.05)) else: S = 0.5 * np.exp(-(omega-0.3)**2 / (2*0.02)) + \ 0.5 * np.exp(-(omega-1.5)**2 / (2*0.02)) axes[0, col].plot(omega, S, 'b-', linewidth=2) axes[0, col].fill_between(omega, 0, S, alpha=0.3) axes[0, col].set_title(f'Spectrum: {title}') axes[0, col].set_xlabel('Frequency ω') axes[0, col].set_ylabel('S(ω)') axes[0, col].grid(True, alpha=0.3) # GP samples L = np.linalg.cholesky(K) for i in range(3): sample = L @ np.random.randn(n) axes[1, col].plot(X, sample, alpha=0.7) axes[1, col].set_title('GP Samples') axes[1, col].set_xlabel('x') axes[1, col].set_ylabel('f(x)') axes[1, col].grid(True, alpha=0.3) plt.suptitle('Spectral Mixture Kernel: Flexible Pattern Discovery', fontsize=14) plt.tight_layout() return fig demo_spectral_mixture()Practical Use of SM Kernels
When to Use SM Kernels:
The spectral representation enables efficient approximation of GP kernels via Random Fourier Features (RFF), a technique that reduces the O(n³) cost of exact GPs.
The Key Insight
By Bochner's theorem and the inverse Fourier transform:
$$k(\mathbf{x} - \mathbf{x}') = \int S(\boldsymbol{\omega}) e^{i\boldsymbol{\omega}^\top(\mathbf{x} - \mathbf{x}')} d\boldsymbol{\omega} = \mathbb{E}{\boldsymbol{\omega} \sim p}[\phi{\boldsymbol{\omega}}(\mathbf{x})^* \phi_{\boldsymbol{\omega}}(\mathbf{x}')]$$
where $p(\boldsymbol{\omega}) \propto S(\boldsymbol{\omega})$ and $\phi_{\boldsymbol{\omega}}(\mathbf{x}) = e^{i\boldsymbol{\omega}^\top\mathbf{x}}$.
Monte Carlo Approximation
Sample $D$ frequencies $\boldsymbol{\omega}_1, ..., \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega})$ and approximate:
$$k(\mathbf{x}, \mathbf{x}') \approx \boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\phi}(\mathbf{x}')$$
where $\boldsymbol{\phi}(\mathbf{x}) = \frac{1}{\sqrt{D}}[\cos(\boldsymbol{\omega}_1^\top\mathbf{x} + b_1), ..., \cos(\boldsymbol{\omega}_D^\top\mathbf{x} + b_D)]$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npimport matplotlib.pyplot as plt class RBFRandomFourierFeatures: """ Random Fourier Features approximation for RBF kernel. """ def __init__(self, n_features=100, length_scale=1.0, seed=42): self.D = n_features self.length_scale = length_scale np.random.seed(seed) # Sample frequencies from spectral density (Gaussian for RBF) self.omega = np.random.randn(n_features) / length_scale self.b = np.random.uniform(0, 2*np.pi, n_features) def transform(self, X): """Map inputs to random feature space.""" X = np.atleast_1d(X).reshape(-1, 1) # φ(x) = sqrt(2/D) * cos(ωᵀx + b) projection = X @ self.omega.reshape(1, -1) + self.b return np.sqrt(2/self.D) * np.cos(projection) def approximate_kernel(self, X1, X2): """Approximate k(X1, X2) ≈ φ(X1)ᵀφ(X2).""" phi1 = self.transform(X1) phi2 = self.transform(X2) return phi1 @ phi2.T def demo_rff_approximation(): """Show how RFF approximation improves with more features.""" X = np.linspace(-3, 3, 100) # True RBF kernel sq_dists = np.subtract.outer(X, X)**2 K_true = np.exp(-sq_dists / 2) fig, axes = plt.subplots(1, 4, figsize=(16, 4)) n_features_list = [10, 50, 200, 1000] for ax, D in zip(axes, n_features_list): rff = RBFRandomFourierFeatures(n_features=D, length_scale=1.0) K_approx = rff.approximate_kernel(X, X) error = np.abs(K_true - K_approx) max_error = np.max(error) im = ax.imshow(K_approx, extent=[-3, 3, 3, -3], cmap='viridis') ax.set_title(f'D={D} features\nMax error: {max_error:.4f}') ax.set_xlabel("x'") ax.set_ylabel('x') plt.suptitle('RFF Approximation Quality vs Number of Features', fontsize=14) plt.tight_layout() return fig demo_rff_approximation()With RFF, GP regression costs O(nD² + D³) instead of O(n³). For large n and moderate D, this is a huge saving. With D=1000 features, you can handle millions of data points that would be impossible with exact GPs.
The spectral view provides powerful intuitions for kernel design and diagnosis.
Diagnosing Model-Data Mismatch
Designing Custom Kernels
To create a kernel with specific frequency content:
What's Next:
We've covered kernel design from multiple angles: individual kernels, composition, and spectral analysis. The final page addresses Automatic Relevance Determination (ARD)—using per-dimension length-scales for feature selection and handling high-dimensional inputs.
You now understand the spectral perspective on kernels: Bochner's theorem, spectral densities of common kernels, the powerful Spectral Mixture Kernel, and Random Fourier Features for scalability. This frequency-domain view completes your kernel design toolkit.