Loading content...
We have claimed that kernel functions compute inner products in some feature space. But how do we know such a feature space exists? What conditions must a function satisfy to be a valid kernel?
Mercer's theorem provides the definitive answer. This classical result from functional analysis establishes the precise conditions under which a symmetric function corresponds to an inner product in a Hilbert space. It is the theoretical foundation that justifies all of kernel methods.
Understanding Mercer's theorem deepens our insight into what kernels really are: not arbitrary similarity functions, but structured mathematical objects with rich geometric interpretations.
By the end of this page, you will be able to: • State Mercer's theorem and understand its conditions • Verify whether a function satisfies Mercer's conditions • Understand the eigenfunction expansion of kernels • Connect Mercer's theorem to reproducing kernel Hilbert spaces (RKHS) • Apply this theory to analyze kernel properties
Mercer's theorem was proven by James Mercer in 1909, decades before machine learning existed. Mercer was studying integral equations of the form:
$$\int_a^b k(x, t) f(t) , dt = \lambda f(x)$$
He sought conditions under which the kernel $k(x, t)$ could be expanded as a sum of products of eigenfunctions. His theorem provided these conditions and established a deep connection between symmetric positive definite functions and inner product spaces.
The ML Connection
In the 1990s, researchers working on Support Vector Machines (Boser, Guyon, Vapnik) recognized that Mercer's mathematical framework was exactly what they needed to justify the kernel trick. A function $k$ works as a kernel if and only if it satisfies Mercer's conditions—meaning there exists a feature space where $k$ computes inner products.
This connection transformed kernel methods from clever heuristics into a mathematically rigorous framework.
Mercer's theorem illustrates how mathematics developed for one purpose (integral equations in physics) can revolutionize another field (machine learning) a century later. This is why deep mathematical foundations matter—they provide lasting, transferable insights.
Let us state Mercer's theorem precisely.
Mercer's Theorem (Continuous Case)
Let $\mathcal{X}$ be a compact subset of $\mathbb{R}^d$, and let $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ be a continuous, symmetric function. Then $k$ is positive semi-definite (i.e., for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n} \subset \mathcal{X}$ and any $\mathbf{c} \in \mathbb{R}^n$, we have $\sum{i,j} c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$) if and only if:
$$k(\mathbf{x}, \mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}')$$
where:
Mercer's theorem says that every valid kernel can be decomposed as an infinite sum of products of eigenfunctions, weighted by eigenvalues. This decomposition directly gives us the feature map:
$$\phi(\mathbf{x}) = \left( \sqrt{\lambda_1} \psi_1(\mathbf{x}), \sqrt{\lambda_2} \psi_2(\mathbf{x}), \ldots \right)$$
The inner product of feature vectors recovers the kernel!
The Finite-Points Version
For machine learning, we typically need a slightly different formulation. For any finite collection of points, the kernel matrix must be positive semi-definite:
Definition (Positive Semi-Definite Kernel)
A function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a positive semi-definite (p.s.d.) kernel if it is symmetric ($k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$) and for every $n \in \mathbb{N}$, every set of points ${\mathbf{x}_1, \ldots, \mathbf{x}_n} \subset \mathcal{X}$, and every $\mathbf{c} \in \mathbb{R}^n$:
$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$
Theorem (Equivalence)
$k$ is a p.s.d. kernel if and only if there exists a Hilbert space $\mathcal{H}$ and a feature map $\phi: \mathcal{X} \rightarrow \mathcal{H}$ such that: $$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$
Mercer's theorem tells us that every valid kernel admits an expansion:
$$k(\mathbf{x}, \mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}')$$
Let us unpack what this means and how it reveals the feature map.
Eigenfunctions and Eigenvalues
The eigenfunctions ${\psi_j}$ and eigenvalues ${\lambda_j}$ come from the integral operator $T_k$ defined by:
$$(T_k f)(\mathbf{x}) = \int_{\mathcal{X}} k(\mathbf{x}, \mathbf{x}') f(\mathbf{x}') , d\mu(\mathbf{x}')$$
where $\mu$ is a measure on $\mathcal{X}$ (often the uniform measure for compact $\mathcal{X}$).
The eigenfunctions satisfy: $$T_k \psi_j = \lambda_j \psi_j$$
Meaning: "smoothing" $\psi_j$ by the kernel returns $\lambda_j$ times $\psi_j$. The eigenfunctions are the fundamental "modes" of the kernel.
Constructing the Feature Map
From the Mercer expansion, we can explicitly construct the feature map:
$$\phi(\mathbf{x}) = \left( \sqrt{\lambda_1} \psi_1(\mathbf{x}), \sqrt{\lambda_2} \psi_2(\mathbf{x}), \sqrt{\lambda_3} \psi_3(\mathbf{x}), \ldots \right)$$
Then: $$\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle = \sum_{j=1}^{\infty} \sqrt{\lambda_j} \psi_j(\mathbf{x}) \cdot \sqrt{\lambda_j} \psi_j(\mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}') = k(\mathbf{x}, \mathbf{x}')$$
The kernel is recovered as the inner product of feature vectors!
Eigenvalue Decay
The eigenvalues $\lambda_j$ decay to zero as $j \rightarrow \infty$ (for continuous kernels on compact domains). The rate of decay characterizes the kernel's smoothness:
Eigenvalue decay tells us how many dimensions of the feature space "matter." If $\lambda_j$ decreases quickly, only the first few eigenfunctions contribute significantly, and the feature space is effectively low-dimensional. If $\lambda_j$ decreases slowly, many eigenfunctions matter, and the effective dimension is high.
The RBF kernel has rapidly decaying eigenvalues, explaining its smooth interpolation behavior.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as npimport matplotlib.pyplot as plt def compute_mercer_expansion(X, kernel_func, n_components=10): """ Compute approximate Mercer expansion via eigendecomposition of the discrete kernel matrix. For n points, we get n eigenvalue/eigenvector pairs. These approximate the continuous eigenfunctions. """ n = X.shape[0] # Compute kernel matrix K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = kernel_func(X[i], X[j]) # Eigendecomposition (K = V Λ V^T) eigenvalues, eigenvectors = np.linalg.eigh(K) # Sort by decreasing eigenvalue idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Normalize eigenvectors (they should form discrete approximations # to the continuous eigenfunctions) # The continuous eigenfunctions are orthonormal in L^2 # The discrete eigenvectors are orthonormal in the discrete inner product return eigenvalues[:n_components], eigenvectors[:, :n_components] def rbf_kernel(x, y, gamma=1.0): return np.exp(-gamma * np.sum((x - y)**2)) # Sample pointsnp.random.seed(42)n = 100X = np.linspace(-1, 1, n).reshape(-1, 1) # Compute Mercer expansion for RBF kerneleigenvalues, eigenvectors = compute_mercer_expansion( X, lambda x, y: rbf_kernel(x, y, gamma=1.0), n_components=20) # Show eigenvalue decayprint("RBF Kernel Eigenvalue Decay (γ = 1.0)")print("=" * 50)print(f"{'j':>3} {'λⱼ':>12} {'Cumulative %':>12}")print("-" * 50) total = np.sum(eigenvalues)cumulative = 0for j, lam in enumerate(eigenvalues): cumulative += lam pct = 100 * cumulative / total print(f"{j+1:>3} {lam:>12.4f} {pct:>11.1f}%") if pct > 99.9: print(f"... (remaining eigenvalues < 0.1% of total)") break # Reconstruct kernel from top k eigenfunctionsdef reconstruct_kernel(eigenvalues, eigenvectors, x_idx, y_idx): """ k(x, y) ≈ Σⱼ λⱼ ψⱼ(x) ψⱼ(y) For discrete representation: K[i,j] ≈ Σⱼ λⱼ v[i,j] v[i,j] """ return np.sum(eigenvalues * eigenvectors[x_idx, :] * eigenvectors[y_idx, :]) # Verify reconstructioni, j = 25, 75true_k = rbf_kernel(X[i], X[j], gamma=1.0)approx_k = reconstruct_kernel(eigenvalues, eigenvectors, i, j)print(f"Kernel reconstruction at indices ({i}, {j}):")print(f" True k(x,y): {true_k:.6f}")print(f" Reconstructed: {approx_k:.6f}")print(f" Error: {abs(true_k - approx_k):.6f}")Mercer's theorem is intimately connected to Reproducing Kernel Hilbert Spaces (RKHS)—the natural function spaces associated with kernels. Understanding RKHS provides deep insight into what kernel methods actually optimize.
Definition (RKHS)
A Hilbert space $\mathcal{H}$ of functions $f: \mathcal{X} \rightarrow \mathbb{R}$ is a Reproducing Kernel Hilbert Space if there exists a function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ such that:
Reproducing property: For all $f \in \mathcal{H}$ and $\mathbf{x} \in \mathcal{X}$: $$f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}}$$ (The function value at $\mathbf{x}$ is the inner product with the kernel function $k(\cdot, \mathbf{x})$)
Kernel functions are in the space: $k(\cdot, \mathbf{x}) \in \mathcal{H}$ for all $\mathbf{x} \in \mathcal{X}$
The function $k$ is the reproducing kernel of $\mathcal{H}$.
The reproducing property is remarkable: to evaluate $f(\mathbf{x})$, you take the inner product of $f$ with $k(\cdot, \mathbf{x})$—the kernel "centered" at $\mathbf{x}$.
This means the kernel encodes how to extract function values, making evaluation a Hilbert space operation. It's why kernel methods can express solutions as linear combinations of kernel functions.
Connections to Kernels
The Moore-Aronszajn theorem states that every p.s.d. kernel uniquely determines an RKHS, and conversely, every RKHS has a unique reproducing kernel. This one-to-one correspondence means:
The RKHS Norm and Smoothness
Functions in the RKHS have an associated norm $|f|_{\mathcal{H}}$. This norm penalizes "complexity" in a kernel-specific way:
When we solve kernel ridge regression: $$\min_f \sum_{i=1}^n (y_i - f(\mathbf{x}i))^2 + \lambda |f|{\mathcal{H}}^2$$
we are penalizing the RKHS norm, which encourages smooth (in the kernel's sense) solutions.
Given a candidate kernel function, how do we verify that it satisfies Mercer's conditions (i.e., is positive semi-definite)?
Method 1: Eigenvalue Check (Numerical)
For a specific set of points, check that the kernel matrix has non-negative eigenvalues:
1. Sample n points from the domain
2. Compute the n × n kernel matrix K
3. Compute eigenvalues of K
4. Verify all eigenvalues ≥ 0 (within numerical tolerance)
This only checks p.s.d. for one point set. A function could pass this test but fail for other points. However, if it fails for any point set, it's definitely not a valid kernel.
Method 2: Known Constructions
Use kernel algebra: start from known valid kernels and combine them using operations that preserve p.s.d.:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as np def verify_psd(kernel_func, n_samples=100, n_trials=10, tol=1e-10): """ Numerically verify that a kernel function is positive semi-definite by checking eigenvalues of kernel matrices for random point sets. Returns: bool: True if all trials pass, False otherwise float: minimum eigenvalue observed """ min_eigenvalue_seen = float('inf') for trial in range(n_trials): # Generate random points (in [-1, 1]^d) d = np.random.randint(1, 5) # Random dimension X = np.random.uniform(-1, 1, size=(n_samples, d)) # Compute kernel matrix K = np.zeros((n_samples, n_samples)) for i in range(n_samples): for j in range(n_samples): K[i, j] = kernel_func(X[i], X[j]) # Check symmetry if not np.allclose(K, K.T): print(f" Trial {trial}: FAILED - not symmetric") return False, -np.inf # Check eigenvalues eigenvalues = np.linalg.eigvalsh(K) min_eig = np.min(eigenvalues) min_eigenvalue_seen = min(min_eigenvalue_seen, min_eig) if min_eig < -tol: print(f" Trial {trial}: FAILED - negative eigenvalue: {min_eig}") return False, min_eig return True, min_eigenvalue_seen # Test various kernel candidatesprint("Testing kernel validity via eigenvalue analysis")print("=" * 60) # Valid kernelsdef linear_kernel(x, y): return np.dot(x, y) def rbf_kernel(x, y): return np.exp(-np.sum((x - y)**2)) def polynomial_kernel(x, y): return (np.dot(x, y) + 1)**3 # Invalid "kernels"def negative_distance(x, y): return -np.linalg.norm(x - y) def cosine_distance(x, y): return np.cos(np.linalg.norm(x - y)) def absolute_difference(x, y): # |x - y| can be written as sqrt(distance²), # but this specific form is NOT p.s.d. return np.sum(np.abs(x - y)) candidates = [ ("Linear kernel", linear_kernel), ("RBF kernel", rbf_kernel), ("Polynomial kernel (deg 3)", polynomial_kernel), ("-||x - y|| (neg distance)", negative_distance), ("cos(||x - y||)", cosine_distance), ("||x - y||₁ (L1 norm)", absolute_difference),] for name, func in candidates: is_valid, min_eig = verify_psd(func, n_samples=50, n_trials=5) status = "✓ Valid" if is_valid else "✗ Invalid" print(f"{name:30} {status:10} (min eigenvalue: {min_eig:.4f})")Method 3: Direct Proof via Feature Map
The most rigorous method: explicitly construct a feature map $\phi$ such that $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$. If such a feature map exists, the kernel is automatically valid.
Example: Proving RBF is Valid
The RBF kernel can be derived from the linear kernel through valid operations:
$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp(-\gamma|\mathbf{x} - \mathbf{x}'|^2)$$
Expanding: $$= \exp(-\gamma|\mathbf{x}|^2) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot \exp(-\gamma|\mathbf{x}'|^2)$$
Let $c(\mathbf{x}) = \exp(-\gamma|\mathbf{x}|^2)$. Then: $$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = c(\mathbf{x}) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot c(\mathbf{x}')$$
We can show $\exp(\alpha \mathbf{x}^\top\mathbf{x}')$ is a valid kernel (via Taylor expansion of exp, which has all positive coefficients applied to the valid linear kernel). The $c(\mathbf{x})$ terms just rescale the feature map. Thus RBF is valid.
Watch out for these common mistakes:
• Euclidean distance $|\mathbf{x} - \mathbf{x}'|$ is NOT p.s.d. • Squared distance $|\mathbf{x} - \mathbf{x}'|^2$ is NOT p.s.d. (it equals $k(\mathbf{x}, \mathbf{x}) + k(\mathbf{x}', \mathbf{x}') - 2k(\mathbf{x}, \mathbf{x}')$ for linear $k$) • Cosine similarity without proper normalization can sometimes fail • Many graph similarity measures are not valid kernels
Beyond validity, some kernels have special properties that make them particularly useful. Two important classes are characteristic and universal kernels.
Characteristic Kernels
A kernel $k$ is characteristic if the function $\mathbf{x} \mapsto k(\cdot, \mathbf{x})$ is injective from probability distributions to the RKHS. Intuitively, different distributions have different "embeddings" in the RKHS, so we can always distinguish them.
Why this matters: Characteristic kernels enable two-sample tests ("are these two datasets from the same distribution?") using the Maximum Mean Discrepancy (MMD): $$\text{MMD}(P, Q) = \left| \mathbb{E}{\mathbf{x} \sim P}[k(\cdot, \mathbf{x})] - \mathbb{E}{\mathbf{y} \sim Q}[k(\cdot, \mathbf{y})] \right|_{\mathcal{H}}$$
For characteristic kernels: $\text{MMD}(P, Q) = 0 \Leftrightarrow P = Q$.
Universal Kernels
A kernel $k$ is universal if its RKHS is dense in the space of continuous functions. That is, for any continuous function $f$ and any $\epsilon > 0$, there exists $g \in \mathcal{H}$ such that $|f - g|_\infty < \epsilon$.
Why this matters: Universal kernels can approximate any continuous target function arbitrarily well (with enough data). This is the kernel analog of the universal approximation theorem for neural networks.
| Kernel | Characteristic? | Universal? | RKHS Dimension |
|---|---|---|---|
| Linear | No | No | Finite (= input dim) |
| Polynomial (finite degree) | No | No | Finite |
| Gaussian RBF | Yes | Yes | Infinite |
| Laplacian | Yes | Yes | Infinite |
| Matérn (ν < ∞) | Yes | Yes (on compact) | Infinite |
• Use universal kernels (like Gaussian RBF) when you want maximum flexibility—they can learn any smooth function given enough data.
• Non-universal kernels (like polynomial) impose stronger inductive biases, which can be beneficial if they match the true structure.
• For distribution testing, ensure you use a characteristic kernel; otherwise MMD = 0 doesn't imply equal distributions.
Mercer's theorem provides a spectral decomposition of kernels. Analyzing this spectrum reveals deep properties of the kernel and the functions it can represent.
The Kernel Spectrum
For a kernel on a compact domain with measure $\mu$, the eigenvalues ${\lambda_j}$ of the integral operator $T_k$ form the spectrum. Key observations:
Eigenvalue Decay and Smoothness
For the Gaussian RBF kernel on $[-1, 1]$ with bandwidth $\gamma$, eigenvalues decay exponentially: $$\lambda_j \sim \exp(-c \cdot j)$$
This rapid decay means only finitely many dimensions effectively contribute, explaining why RBF models are effectively low-dimensional despite infinite-dimensional feature spaces.
For contrast, polynomial kernels have finite rank (finitely many non-zero eigenvalues), while the linear kernel has rank equal to input dimension.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npimport matplotlib.pyplot as plt def kernel_spectrum(X, kernel_func, top_k=None): """ Compute eigenvalue spectrum of kernel matrix. """ n = X.shape[0] K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = kernel_func(X[i], X[j]) eigenvalues = np.linalg.eigvalsh(K) eigenvalues = np.sort(eigenvalues)[::-1] # Descending order if top_k: return eigenvalues[:top_k] return eigenvalues # Compare spectra of different kernelsnp.random.seed(42)n = 200X = np.linspace(-1, 1, n).reshape(-1, 1) kernels = { 'Linear': lambda x, y: np.dot(x, y), 'Polynomial (d=3)': lambda x, y: (np.dot(x, y) + 1)**3, 'RBF (γ=1)': lambda x, y: np.exp(-1 * np.sum((x-y)**2)), 'RBF (γ=10)': lambda x, y: np.exp(-10 * np.sum((x-y)**2)), 'Laplacian': lambda x, y: np.exp(-1 * np.sum(np.abs(x-y))),} print("Kernel Spectral Analysis")print("=" * 70)print(f"{'Kernel':25} {'Top-5 eigenvalues':>45}")print("-" * 70) spectra = {}for name, k_func in kernels.items(): spectrum = kernel_spectrum(X, k_func, top_k=20) spectra[name] = spectrum top5_str = ", ".join([f"{e:.2f}" for e in spectrum[:5]]) print(f"{name:25} {top5_str:>45}") # Compute effective dimension (number of eigenvalues > 1% of largest)print("Effective dimensions (eigenvalue > 1% of max):")for name, spectrum in spectra.items(): threshold = 0.01 * spectrum[0] eff_dim = np.sum(spectrum > threshold) energy_90 = np.searchsorted(np.cumsum(spectrum) / np.sum(spectrum), 0.9) + 1 print(f" {name:25} eff_dim: {eff_dim:3}, dims for 90% energy: {energy_90:3}")Spectral Interpretation of Learning
The eigenvalue spectrum has profound implications for learning:
Large eigenvalues correspond to "easy" directions in function space—the model learns these first.
Small eigenvalues correspond to "hard" directions—these require more data or will be regularized away.
Regularization $\lambda |f|^2_{\mathcal{H}}$ effectively truncates the spectrum: components with $\lambda_j < \lambda$ are suppressed.
Generalization bounds depend on effective dimension—kernels with rapidly decaying spectra generalize better.
This spectral view connects kernel methods to principal component analysis: both project data onto leading eigendirections, suppressing noise in trailing directions.
Mercer's theorem provides the rigorous mathematical foundation for kernel methods, ensuring that valid kernels correspond to inner products in well-defined feature spaces.
Core Results
You now understand the mathematical foundations of valid kernels through Mercer's theorem. In the final page of this module, we will survey common kernels—the specific kernel functions used in practice and their properties.