The Kernel Trick - Learning Module

Loading content...

0/278

Mercer's Theorem

The Mathematical Foundation

We have claimed that kernel functions compute inner products in some feature space. But how do we know such a feature space exists? What conditions must a function satisfy to be a valid kernel?

Mercer's theorem provides the definitive answer. This classical result from functional analysis establishes the precise conditions under which a symmetric function corresponds to an inner product in a Hilbert space. It is the theoretical foundation that justifies all of kernel methods.

Understanding Mercer's theorem deepens our insight into what kernels really are: not arbitrary similarity functions, but structured mathematical objects with rich geometric interpretations.

What You Will Learn

By the end of this page, you will be able to: • State Mercer's theorem and understand its conditions • Verify whether a function satisfies Mercer's conditions • Understand the eigenfunction expansion of kernels • Connect Mercer's theorem to reproducing kernel Hilbert spaces (RKHS) • Apply this theory to analyze kernel properties

Historical Context

Mercer's theorem was proven by James Mercer in 1909, decades before machine learning existed. Mercer was studying integral equations of the form:

$$\int_a^b k(x, t) f(t) , dt = \lambda f(x)$$

He sought conditions under which the kernel $k(x, t)$ could be expanded as a sum of products of eigenfunctions. His theorem provided these conditions and established a deep connection between symmetric positive definite functions and inner product spaces.

The ML Connection

In the 1990s, researchers working on Support Vector Machines (Boser, Guyon, Vapnik) recognized that Mercer's mathematical framework was exactly what they needed to justify the kernel trick. A function $k$ works as a kernel if and only if it satisfies Mercer's conditions—meaning there exists a feature space where $k$ computes inner products.

This connection transformed kernel methods from clever heuristics into a mathematically rigorous framework.

The Power of Cross-Pollination

Mercer's theorem illustrates how mathematics developed for one purpose (integral equations in physics) can revolutionize another field (machine learning) a century later. This is why deep mathematical foundations matter—they provide lasting, transferable insights.

Statement of Mercer's Theorem

Let us state Mercer's theorem precisely.

Mercer's Theorem (Continuous Case)

Let $\mathcal{X}$ be a compact subset of $\mathbb{R}^d$, and let $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ be a continuous, symmetric function. Then $k$ is positive semi-definite (i.e., for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n} \subset \mathcal{X}$ and any $\mathbf{c} \in \mathbb{R}^n$, we have $\sum{i,j} c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$) if and only if:

$$k(\mathbf{x}, \mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}')$$

where:

$\lambda_1 \geq \lambda_2 \geq \cdots \geq 0$ are non-negative eigenvalues
${\psi_j}_{j=1}^{\infty}$ are orthonormal eigenfunctions in $L^2(\mathcal{X})$
The series converges absolutely and uniformly

What This Means

Mercer's theorem says that every valid kernel can be decomposed as an infinite sum of products of eigenfunctions, weighted by eigenvalues. This decomposition directly gives us the feature map:

$$\phi(\mathbf{x}) = \left( \sqrt{\lambda_1} \psi_1(\mathbf{x}), \sqrt{\lambda_2} \psi_2(\mathbf{x}), \ldots \right)$$

The inner product of feature vectors recovers the kernel!

The Finite-Points Version

For machine learning, we typically need a slightly different formulation. For any finite collection of points, the kernel matrix must be positive semi-definite:

Definition (Positive Semi-Definite Kernel)

A function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a positive semi-definite (p.s.d.) kernel if it is symmetric ($k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$) and for every $n \in \mathbb{N}$, every set of points ${\mathbf{x}_1, \ldots, \mathbf{x}_n} \subset \mathcal{X}$, and every $\mathbf{c} \in \mathbb{R}^n$:

$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$

Theorem (Equivalence)

$k$ is a p.s.d. kernel if and only if there exists a Hilbert space $\mathcal{H}$ and a feature map $\phi: \mathcal{X} \rightarrow \mathcal{H}$ such that: $$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$

The Eigenfunction Expansion

Mercer's theorem tells us that every valid kernel admits an expansion:

$$k(\mathbf{x}, \mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}')$$

Let us unpack what this means and how it reveals the feature map.

Eigenfunctions and Eigenvalues

The eigenfunctions ${\psi_j}$ and eigenvalues ${\lambda_j}$ come from the integral operator $T_k$ defined by:

$$(T_k f)(\mathbf{x}) = \int_{\mathcal{X}} k(\mathbf{x}, \mathbf{x}') f(\mathbf{x}') , d\mu(\mathbf{x}')$$

where $\mu$ is a measure on $\mathcal{X}$ (often the uniform measure for compact $\mathcal{X}$).

The eigenfunctions satisfy: $$T_k \psi_j = \lambda_j \psi_j$$

Meaning: "smoothing" $\psi_j$ by the kernel returns $\lambda_j$ times $\psi_j$. The eigenfunctions are the fundamental "modes" of the kernel.

Constructing the Feature Map

From the Mercer expansion, we can explicitly construct the feature map:

$$\phi(\mathbf{x}) = \left( \sqrt{\lambda_1} \psi_1(\mathbf{x}), \sqrt{\lambda_2} \psi_2(\mathbf{x}), \sqrt{\lambda_3} \psi_3(\mathbf{x}), \ldots \right)$$

Then: $$\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle = \sum_{j=1}^{\infty} \sqrt{\lambda_j} \psi_j(\mathbf{x}) \cdot \sqrt{\lambda_j} \psi_j(\mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}') = k(\mathbf{x}, \mathbf{x}')$$

The kernel is recovered as the inner product of feature vectors!

Eigenvalue Decay

The eigenvalues $\lambda_j$ decay to zero as $j \rightarrow \infty$ (for continuous kernels on compact domains). The rate of decay characterizes the kernel's smoothness:

Fast decay (exponential): very smooth kernels, finite effective dimension
Slow decay (polynomial): rougher kernels, higher effective dimension
No decay: linear kernel, finite-dimensional feature space

Intuition: Eigenvalue Decay

Eigenvalue decay tells us how many dimensions of the feature space "matter." If $\lambda_j$ decreases quickly, only the first few eigenfunctions contribute significantly, and the feature space is effectively low-dimensional. If $\lambda_j$ decreases slowly, many eigenfunctions matter, and the effective dimension is high.

The RBF kernel has rapidly decaying eigenvalues, explaining its smooth interpolation behavior.

mercer_expansion.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
import matplotlib.pyplot as plt
 
def compute_mercer_expansion(X, kernel_func, n_components=10):
    """
    Compute approximate Mercer expansion via eigendecomposition
    of the discrete kernel matrix.
    
    For n points, we get n eigenvalue/eigenvector pairs.
    These approximate the continuous eigenfunctions.
    """
    n = X.shape[0]
    
    # Compute kernel matrix
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j])
    
    # Eigendecomposition (K = V Λ V^T)
    eigenvalues, eigenvectors = np.linalg.eigh(K)
    
    # Sort by decreasing eigenvalue
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Normalize eigenvectors (they should form discrete approximations
    # to the continuous eigenfunctions)
    # The continuous eigenfunctions are orthonormal in L^2
    # The discrete eigenvectors are orthonormal in the discrete inner product
    
    return eigenvalues[:n_components], eigenvectors[:, :n_components]
 
def rbf_kernel(x, y, gamma=1.0):
    return np.exp(-gamma * np.sum((x - y)**2))
 
# Sample points
np.random.seed(42)
n = 100
X = np.linspace(-1, 1, n).reshape(-1, 1)
 
# Compute Mercer expansion for RBF kernel
eigenvalues, eigenvectors = compute_mercer_expansion(
    X, 
    lambda x, y: rbf_kernel(x, y, gamma=1.0),
    n_components=20
)
 
# Show eigenvalue decay
print("RBF Kernel Eigenvalue Decay (γ = 1.0)")
print("=" * 50)
print(f"{'j':>3} {'λⱼ':>12} {'Cumulative %':>12}")
print("-" * 50)
 
total = np.sum(eigenvalues)
cumulative = 0
for j, lam in enumerate(eigenvalues):
    cumulative += lam
    pct = 100 * cumulative / total
    print(f"{j+1:>3} {lam:>12.4f} {pct:>11.1f}%")
    if pct > 99.9:
        print(f"... (remaining eigenvalues < 0.1% of total)")
        break
 
# Reconstruct kernel from top k eigenfunctions
def reconstruct_kernel(eigenvalues, eigenvectors, x_idx, y_idx):
    """
    k(x, y) ≈ Σⱼ λⱼ ψⱼ(x) ψⱼ(y)
    
    For discrete representation: K[i,j] ≈ Σⱼ λⱼ v[i,j] v[i,j]
    """
    return np.sum(eigenvalues * eigenvectors[x_idx, :] * eigenvectors[y_idx, :])
 
# Verify reconstruction
i, j = 25, 75
true_k = rbf_kernel(X[i], X[j], gamma=1.0)
approx_k = reconstruct_kernel(eigenvalues, eigenvectors, i, j)
print(f"
Kernel reconstruction at indices ({i}, {j}):")
print(f"  True k(x,y):        {true_k:.6f}")
print(f"  Reconstructed:      {approx_k:.6f}")
print(f"  Error:              {abs(true_k - approx_k):.6f}")

Reproducing Kernel Hilbert Spaces (RKHS)

Mercer's theorem is intimately connected to Reproducing Kernel Hilbert Spaces (RKHS)—the natural function spaces associated with kernels. Understanding RKHS provides deep insight into what kernel methods actually optimize.

Definition (RKHS)

A Hilbert space $\mathcal{H}$ of functions $f: \mathcal{X} \rightarrow \mathbb{R}$ is a Reproducing Kernel Hilbert Space if there exists a function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ such that:

Reproducing property: For all $f \in \mathcal{H}$ and $\mathbf{x} \in \mathcal{X}$: $$f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}}$$ (The function value at $\mathbf{x}$ is the inner product with the kernel function $k(\cdot, \mathbf{x})$)
Kernel functions are in the space: $k(\cdot, \mathbf{x}) \in \mathcal{H}$ for all $\mathbf{x} \in \mathcal{X}$

The function $k$ is the reproducing kernel of $\mathcal{H}$.

The Reproducing Property

The reproducing property is remarkable: to evaluate $f(\mathbf{x})$, you take the inner product of $f$ with $k(\cdot, \mathbf{x})$—the kernel "centered" at $\mathbf{x}$.

This means the kernel encodes how to extract function values, making evaluation a Hilbert space operation. It's why kernel methods can express solutions as linear combinations of kernel functions.

Connections to Kernels

The Moore-Aronszajn theorem states that every p.s.d. kernel uniquely determines an RKHS, and conversely, every RKHS has a unique reproducing kernel. This one-to-one correspondence means:

Choosing a kernel = choosing a function space
The kernel encodes the geometry of the function space
The RKHS norm measures function complexity

The RKHS Norm and Smoothness

Functions in the RKHS have an associated norm $|f|_{\mathcal{H}}$. This norm penalizes "complexity" in a kernel-specific way:

For RBF kernel: high-frequency oscillations have large norm
For polynomial kernel: high-degree components have large norm
For linear kernel: $|f|_{\mathcal{H}} = |\mathbf{w}|$ (weight magnitude)

When we solve kernel ridge regression: $$\min_f \sum_{i=1}^n (y_i - f(\mathbf{x}i))^2 + \lambda |f|{\mathcal{H}}^2$$

we are penalizing the RKHS norm, which encourages smooth (in the kernel's sense) solutions.

Key RKHS Properties

•Unique correspondence: Every p.s.d. kernel defines exactly one RKHS, and vice versa.
•Representer theorem: Solutions to regularized problems in RKHS lie in span{$k(\cdot, \mathbf{x}_i)$}.
•Function evaluation is bounded: $|f(\mathbf{x})| \leq |f|_{\mathcal{H}} \sqrt{k(\mathbf{x}, \mathbf{x})}$ (Cauchy-Schwarz).
•Smoothness control: The RKHS norm encodes smoothness assumptions implicit in the kernel.
•Probabilistic interpretation: Functions with small RKHS norm are "probable" under a Gaussian process with that kernel.

Verifying Mercer's Conditions

Given a candidate kernel function, how do we verify that it satisfies Mercer's conditions (i.e., is positive semi-definite)?

Method 1: Eigenvalue Check (Numerical)

For a specific set of points, check that the kernel matrix has non-negative eigenvalues:

1. Sample n points from the domain
2. Compute the n × n kernel matrix K
3. Compute eigenvalues of K
4. Verify all eigenvalues ≥ 0 (within numerical tolerance)

This only checks p.s.d. for one point set. A function could pass this test but fail for other points. However, if it fails for any point set, it's definitely not a valid kernel.

Method 2: Known Constructions

Use kernel algebra: start from known valid kernels and combine them using operations that preserve p.s.d.:

Sum of valid kernels
Product of valid kernels
Positive scalar times valid kernel
Exponential of valid kernel
Polynomial with non-negative coefficients of valid kernel

verify_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
 
def verify_psd(kernel_func, n_samples=100, n_trials=10, tol=1e-10):
    """
    Numerically verify that a kernel function is positive semi-definite
    by checking eigenvalues of kernel matrices for random point sets.
    
    Returns:
        bool: True if all trials pass, False otherwise
        float: minimum eigenvalue observed
    """
    min_eigenvalue_seen = float('inf')
    
    for trial in range(n_trials):
        # Generate random points (in [-1, 1]^d)
        d = np.random.randint(1, 5)  # Random dimension
        X = np.random.uniform(-1, 1, size=(n_samples, d))
        
        # Compute kernel matrix
        K = np.zeros((n_samples, n_samples))
        for i in range(n_samples):
            for j in range(n_samples):
                K[i, j] = kernel_func(X[i], X[j])
        
        # Check symmetry
        if not np.allclose(K, K.T):
            print(f"  Trial {trial}: FAILED - not symmetric")
            return False, -np.inf
        
        # Check eigenvalues
        eigenvalues = np.linalg.eigvalsh(K)
        min_eig = np.min(eigenvalues)
        min_eigenvalue_seen = min(min_eigenvalue_seen, min_eig)
        
        if min_eig < -tol:
            print(f"  Trial {trial}: FAILED - negative eigenvalue: {min_eig}")
            return False, min_eig
    
    return True, min_eigenvalue_seen
 
# Test various kernel candidates
print("Testing kernel validity via eigenvalue analysis")
print("=" * 60)
 
# Valid kernels
def linear_kernel(x, y):
    return np.dot(x, y)
 
def rbf_kernel(x, y):
    return np.exp(-np.sum((x - y)**2))
 
def polynomial_kernel(x, y):
    return (np.dot(x, y) + 1)**3
 
# Invalid "kernels"
def negative_distance(x, y):
    return -np.linalg.norm(x - y)
 
def cosine_distance(x, y):
    return np.cos(np.linalg.norm(x - y))
 
def absolute_difference(x, y):
    # |x - y| can be written as sqrt(distance²), 
    # but this specific form is NOT p.s.d.
    return np.sum(np.abs(x - y))
 
candidates = [
    ("Linear kernel", linear_kernel),
    ("RBF kernel", rbf_kernel),
    ("Polynomial kernel (deg 3)", polynomial_kernel),
    ("-||x - y|| (neg distance)", negative_distance),
    ("cos(||x - y||)", cosine_distance),
    ("||x - y||₁ (L1 norm)", absolute_difference),
]
 
for name, func in candidates:
    is_valid, min_eig = verify_psd(func, n_samples=50, n_trials=5)
    status = "✓ Valid" if is_valid else "✗ Invalid"
    print(f"{name:30} {status:10} (min eigenvalue: {min_eig:.4f})")

Method 3: Direct Proof via Feature Map

The most rigorous method: explicitly construct a feature map $\phi$ such that $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$. If such a feature map exists, the kernel is automatically valid.

Example: Proving RBF is Valid

The RBF kernel can be derived from the linear kernel through valid operations:

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp(-\gamma|\mathbf{x} - \mathbf{x}'|^2)$$

Expanding: $$= \exp(-\gamma|\mathbf{x}|^2) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot \exp(-\gamma|\mathbf{x}'|^2)$$

Let $c(\mathbf{x}) = \exp(-\gamma|\mathbf{x}|^2)$. Then: $$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = c(\mathbf{x}) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot c(\mathbf{x}')$$

We can show $\exp(\alpha \mathbf{x}^\top\mathbf{x}')$ is a valid kernel (via Taylor expansion of exp, which has all positive coefficients applied to the valid linear kernel). The $c(\mathbf{x})$ terms just rescale the feature map. Thus RBF is valid.

Common Invalid 'Kernels'

Watch out for these common mistakes:

• Euclidean distance $|\mathbf{x} - \mathbf{x}'|$ is NOT p.s.d. • Squared distance $|\mathbf{x} - \mathbf{x}'|^2$ is NOT p.s.d. (it equals $k(\mathbf{x}, \mathbf{x}) + k(\mathbf{x}', \mathbf{x}') - 2k(\mathbf{x}, \mathbf{x}')$ for linear $k$) • Cosine similarity without proper normalization can sometimes fail • Many graph similarity measures are not valid kernels

Characteristic and Universal Kernels

Beyond validity, some kernels have special properties that make them particularly useful. Two important classes are characteristic and universal kernels.

Characteristic Kernels

A kernel $k$ is characteristic if the function $\mathbf{x} \mapsto k(\cdot, \mathbf{x})$ is injective from probability distributions to the RKHS. Intuitively, different distributions have different "embeddings" in the RKHS, so we can always distinguish them.

Why this matters: Characteristic kernels enable two-sample tests ("are these two datasets from the same distribution?") using the Maximum Mean Discrepancy (MMD): $$\text{MMD}(P, Q) = \left| \mathbb{E}{\mathbf{x} \sim P}[k(\cdot, \mathbf{x})] - \mathbb{E}{\mathbf{y} \sim Q}[k(\cdot, \mathbf{y})] \right|_{\mathcal{H}}$$

For characteristic kernels: $\text{MMD}(P, Q) = 0 \Leftrightarrow P = Q$.

Universal Kernels

A kernel $k$ is universal if its RKHS is dense in the space of continuous functions. That is, for any continuous function $f$ and any $\epsilon > 0$, there exists $g \in \mathcal{H}$ such that $|f - g|_\infty < \epsilon$.

Why this matters: Universal kernels can approximate any continuous target function arbitrarily well (with enough data). This is the kernel analog of the universal approximation theorem for neural networks.

Properties of Common Kernels
Kernel	Characteristic?	Universal?	RKHS Dimension
Linear	No	No	Finite (= input dim)
Polynomial (finite degree)	No	No	Finite
Gaussian RBF	Yes	Yes	Infinite
Laplacian	Yes	Yes	Infinite
Matérn (ν < ∞)	Yes	Yes (on compact)	Infinite

Practical Implications

• Use universal kernels (like Gaussian RBF) when you want maximum flexibility—they can learn any smooth function given enough data.

• Non-universal kernels (like polynomial) impose stronger inductive biases, which can be beneficial if they match the true structure.

• For distribution testing, ensure you use a characteristic kernel; otherwise MMD = 0 doesn't imply equal distributions.

Spectral Analysis of Kernels

Mercer's theorem provides a spectral decomposition of kernels. Analyzing this spectrum reveals deep properties of the kernel and the functions it can represent.

The Kernel Spectrum

For a kernel on a compact domain with measure $\mu$, the eigenvalues ${\lambda_j}$ of the integral operator $T_k$ form the spectrum. Key observations:

Sum of eigenvalues = trace of $T_k$ = $\int_\mathcal{X} k(\mathbf{x}, \mathbf{x}) , d\mu(\mathbf{x})$
Eigenvalue decay rate determines RKHS properties
Effective dimension (number of significant eigenvalues) indicates model complexity

Eigenvalue Decay and Smoothness

For the Gaussian RBF kernel on $[-1, 1]$ with bandwidth $\gamma$, eigenvalues decay exponentially: $$\lambda_j \sim \exp(-c \cdot j)$$

This rapid decay means only finitely many dimensions effectively contribute, explaining why RBF models are effectively low-dimensional despite infinite-dimensional feature spaces.

For contrast, polynomial kernels have finite rank (finitely many non-zero eigenvalues), while the linear kernel has rank equal to input dimension.

spectral_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
 
def kernel_spectrum(X, kernel_func, top_k=None):
    """
    Compute eigenvalue spectrum of kernel matrix.
    """
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j])
    
    eigenvalues = np.linalg.eigvalsh(K)
    eigenvalues = np.sort(eigenvalues)[::-1]  # Descending order
    
    if top_k:
        return eigenvalues[:top_k]
    return eigenvalues
 
# Compare spectra of different kernels
np.random.seed(42)
n = 200
X = np.linspace(-1, 1, n).reshape(-1, 1)
 
kernels = {
    'Linear': lambda x, y: np.dot(x, y),
    'Polynomial (d=3)': lambda x, y: (np.dot(x, y) + 1)**3,
    'RBF (γ=1)': lambda x, y: np.exp(-1 * np.sum((x-y)**2)),
    'RBF (γ=10)': lambda x, y: np.exp(-10 * np.sum((x-y)**2)),
    'Laplacian': lambda x, y: np.exp(-1 * np.sum(np.abs(x-y))),
}
 
print("Kernel Spectral Analysis")
print("=" * 70)
print(f"{'Kernel':25} {'Top-5 eigenvalues':>45}")
print("-" * 70)
 
spectra = {}
for name, k_func in kernels.items():
    spectrum = kernel_spectrum(X, k_func, top_k=20)
    spectra[name] = spectrum
    
    top5_str = ", ".join([f"{e:.2f}" for e in spectrum[:5]])
    print(f"{name:25} {top5_str:>45}")
 
# Compute effective dimension (number of eigenvalues > 1% of largest)
print("
Effective dimensions (eigenvalue > 1% of max):")
for name, spectrum in spectra.items():
    threshold = 0.01 * spectrum[0]
    eff_dim = np.sum(spectrum > threshold)
    energy_90 = np.searchsorted(np.cumsum(spectrum) / np.sum(spectrum), 0.9) + 1
    print(f"  {name:25} eff_dim: {eff_dim:3}, dims for 90% energy: {energy_90:3}")

Spectral Interpretation of Learning

The eigenvalue spectrum has profound implications for learning:

Large eigenvalues correspond to "easy" directions in function space—the model learns these first.
Small eigenvalues correspond to "hard" directions—these require more data or will be regularized away.
Regularization $\lambda |f|^2_{\mathcal{H}}$ effectively truncates the spectrum: components with $\lambda_j < \lambda$ are suppressed.
Generalization bounds depend on effective dimension—kernels with rapidly decaying spectra generalize better.

This spectral view connects kernel methods to principal component analysis: both project data onto leading eigendirections, suppressing noise in trailing directions.

Summary

Mercer's theorem provides the rigorous mathematical foundation for kernel methods, ensuring that valid kernels correspond to inner products in well-defined feature spaces.

Core Results

A function is a valid kernel if and only if it is positive semi-definite
Every valid kernel has a Mercer expansion in terms of eigenfunctions and eigenvalues
The expansion directly gives the feature map: $\phi(\mathbf{x}) = (\sqrt{\lambda_j} \psi_j(\mathbf{x}))_{j=1}^\infty$
Every kernel defines a unique RKHS with the reproducing property

Key Takeaways

•Mercer's theorem states that p.s.d. kernels have eigenfunction expansions with non-negative eigenvalues.
•The expansion gives the feature map — $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ where $\phi$ is constructed from eigenfunctions.
•RKHS provides the geometry — the kernel defines a Hilbert space of functions with the reproducing property.
•Eigenvalue decay indicates complexity — fast decay means effectively low-dimensional, good generalization.
•Universal kernels can approximate anything — kernels like RBF have dense RKHS in continuous functions.
•Verification methods exist — eigenvalue analysis, constructive proofs, or kernel algebra.

Page Complete

You now understand the mathematical foundations of valid kernels through Mercer's theorem. In the final page of this module, we will survey common kernels—the specific kernel functions used in practice and their properties.

Mercer's Theorem

The Mathematical Foundation

We have claimed that kernel functions compute inner products in some feature space. But how do we know such a feature space exists? What conditions must a function satisfy to be a valid kernel?

Understanding Mercer's theorem deepens our insight into what kernels really are: not arbitrary similarity functions, but structured mathematical objects with rich geometric interpretations.

What You Will Learn

Historical Context

Mercer's theorem was proven by James Mercer in 1909, decades before machine learning existed. Mercer was studying integral equations of the form:

$$\int_a^b k(x, t) f(t) , dt = \lambda f(x)$$

The ML Connection

This connection transformed kernel methods from clever heuristics into a mathematically rigorous framework.

The Power of Cross-Pollination

Statement of Mercer's Theorem

Let us state Mercer's theorem precisely.

Mercer's Theorem (Continuous Case)

$$k(\mathbf{x}, \mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}')$$

where:

$\lambda_1 \geq \lambda_2 \geq \cdots \geq 0$ are non-negative eigenvalues
${\psi_j}_{j=1}^{\infty}$ are orthonormal eigenfunctions in $L^2(\mathcal{X})$
The series converges absolutely and uniformly

What This Means

Mercer's theorem says that every valid kernel can be decomposed as an infinite sum of products of eigenfunctions, weighted by eigenvalues. This decomposition directly gives us the feature map:

$$\phi(\mathbf{x}) = \left( \sqrt{\lambda_1} \psi_1(\mathbf{x}), \sqrt{\lambda_2} \psi_2(\mathbf{x}), \ldots \right)$$

The inner product of feature vectors recovers the kernel!

The Finite-Points Version

For machine learning, we typically need a slightly different formulation. For any finite collection of points, the kernel matrix must be positive semi-definite:

Definition (Positive Semi-Definite Kernel)

$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$

Theorem (Equivalence)

The Eigenfunction Expansion

Mercer's theorem tells us that every valid kernel admits an expansion:

$$k(\mathbf{x}, \mathbf{x}') = \sum_{j=1}^{\infty} \lambda_j \psi_j(\mathbf{x}) \psi_j(\mathbf{x}')$$

Let us unpack what this means and how it reveals the feature map.

Eigenfunctions and Eigenvalues

The eigenfunctions ${\psi_j}$ and eigenvalues ${\lambda_j}$ come from the integral operator $T_k$ defined by:

$$(T_k f)(\mathbf{x}) = \int_{\mathcal{X}} k(\mathbf{x}, \mathbf{x}') f(\mathbf{x}') , d\mu(\mathbf{x}')$$

where $\mu$ is a measure on $\mathcal{X}$ (often the uniform measure for compact $\mathcal{X}$).

The eigenfunctions satisfy: $$T_k \psi_j = \lambda_j \psi_j$$

Meaning: "smoothing" $\psi_j$ by the kernel returns $\lambda_j$ times $\psi_j$. The eigenfunctions are the fundamental "modes" of the kernel.

Constructing the Feature Map

From the Mercer expansion, we can explicitly construct the feature map:

$$\phi(\mathbf{x}) = \left( \sqrt{\lambda_1} \psi_1(\mathbf{x}), \sqrt{\lambda_2} \psi_2(\mathbf{x}), \sqrt{\lambda_3} \psi_3(\mathbf{x}), \ldots \right)$$

The kernel is recovered as the inner product of feature vectors!

Eigenvalue Decay

The eigenvalues $\lambda_j$ decay to zero as $j \rightarrow \infty$ (for continuous kernels on compact domains). The rate of decay characterizes the kernel's smoothness:

Fast decay (exponential): very smooth kernels, finite effective dimension
Slow decay (polynomial): rougher kernels, higher effective dimension
No decay: linear kernel, finite-dimensional feature space

Intuition: Eigenvalue Decay

The RBF kernel has rapidly decaying eigenvalues, explaining its smooth interpolation behavior.

mercer_expansion.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
import matplotlib.pyplot as plt
 
def compute_mercer_expansion(X, kernel_func, n_components=10):
    """
    Compute approximate Mercer expansion via eigendecomposition
    of the discrete kernel matrix.
    
    For n points, we get n eigenvalue/eigenvector pairs.
    These approximate the continuous eigenfunctions.
    """
    n = X.shape[0]
    
    # Compute kernel matrix
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j])
    
    # Eigendecomposition (K = V Λ V^T)
    eigenvalues, eigenvectors = np.linalg.eigh(K)
    
    # Sort by decreasing eigenvalue
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Normalize eigenvectors (they should form discrete approximations
    # to the continuous eigenfunctions)
    # The continuous eigenfunctions are orthonormal in L^2
    # The discrete eigenvectors are orthonormal in the discrete inner product
    
    return eigenvalues[:n_components], eigenvectors[:, :n_components]
 
def rbf_kernel(x, y, gamma=1.0):
    return np.exp(-gamma * np.sum((x - y)**2))
 
# Sample points
np.random.seed(42)
n = 100
X = np.linspace(-1, 1, n).reshape(-1, 1)
 
# Compute Mercer expansion for RBF kernel
eigenvalues, eigenvectors = compute_mercer_expansion(
    X, 
    lambda x, y: rbf_kernel(x, y, gamma=1.0),
    n_components=20
)
 
# Show eigenvalue decay
print("RBF Kernel Eigenvalue Decay (γ = 1.0)")
print("=" * 50)
print(f"{'j':>3} {'λⱼ':>12} {'Cumulative %':>12}")
print("-" * 50)
 
total = np.sum(eigenvalues)
cumulative = 0
for j, lam in enumerate(eigenvalues):
    cumulative += lam
    pct = 100 * cumulative / total
    print(f"{j+1:>3} {lam:>12.4f} {pct:>11.1f}%")
    if pct > 99.9:
        print(f"... (remaining eigenvalues < 0.1% of total)")
        break
 
# Reconstruct kernel from top k eigenfunctions
def reconstruct_kernel(eigenvalues, eigenvectors, x_idx, y_idx):
    """
    k(x, y) ≈ Σⱼ λⱼ ψⱼ(x) ψⱼ(y)
    
    For discrete representation: K[i,j] ≈ Σⱼ λⱼ v[i,j] v[i,j]
    """
    return np.sum(eigenvalues * eigenvectors[x_idx, :] * eigenvectors[y_idx, :])
 
# Verify reconstruction
i, j = 25, 75
true_k = rbf_kernel(X[i], X[j], gamma=1.0)
approx_k = reconstruct_kernel(eigenvalues, eigenvectors, i, j)
print(f"
Kernel reconstruction at indices ({i}, {j}):")
print(f"  True k(x,y):        {true_k:.6f}")
print(f"  Reconstructed:      {approx_k:.6f}")
print(f"  Error:              {abs(true_k - approx_k):.6f}")

Reproducing Kernel Hilbert Spaces (RKHS)

Definition (RKHS)

Reproducing property: For all $f \in \mathcal{H}$ and $\mathbf{x} \in \mathcal{X}$: $$f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}}$$ (The function value at $\mathbf{x}$ is the inner product with the kernel function $k(\cdot, \mathbf{x})$)
Kernel functions are in the space: $k(\cdot, \mathbf{x}) \in \mathcal{H}$ for all $\mathbf{x} \in \mathcal{X}$

The function $k$ is the reproducing kernel of $\mathcal{H}$.

The Reproducing Property

The reproducing property is remarkable: to evaluate $f(\mathbf{x})$, you take the inner product of $f$ with $k(\cdot, \mathbf{x})$—the kernel "centered" at $\mathbf{x}$.

This means the kernel encodes how to extract function values, making evaluation a Hilbert space operation. It's why kernel methods can express solutions as linear combinations of kernel functions.

Connections to Kernels

The Moore-Aronszajn theorem states that every p.s.d. kernel uniquely determines an RKHS, and conversely, every RKHS has a unique reproducing kernel. This one-to-one correspondence means:

Choosing a kernel = choosing a function space
The kernel encodes the geometry of the function space
The RKHS norm measures function complexity

The RKHS Norm and Smoothness

Functions in the RKHS have an associated norm $|f|_{\mathcal{H}}$. This norm penalizes "complexity" in a kernel-specific way:

For RBF kernel: high-frequency oscillations have large norm
For polynomial kernel: high-degree components have large norm
For linear kernel: $|f|_{\mathcal{H}} = |\mathbf{w}|$ (weight magnitude)

When we solve kernel ridge regression: $$\min_f \sum_{i=1}^n (y_i - f(\mathbf{x}i))^2 + \lambda |f|{\mathcal{H}}^2$$

we are penalizing the RKHS norm, which encourages smooth (in the kernel's sense) solutions.

Key RKHS Properties

•Unique correspondence: Every p.s.d. kernel defines exactly one RKHS, and vice versa.
•Representer theorem: Solutions to regularized problems in RKHS lie in span{$k(\cdot, \mathbf{x}_i)$}.
•Function evaluation is bounded: $|f(\mathbf{x})| \leq |f|_{\mathcal{H}} \sqrt{k(\mathbf{x}, \mathbf{x})}$ (Cauchy-Schwarz).
•Smoothness control: The RKHS norm encodes smoothness assumptions implicit in the kernel.
•Probabilistic interpretation: Functions with small RKHS norm are "probable" under a Gaussian process with that kernel.

Verifying Mercer's Conditions

Given a candidate kernel function, how do we verify that it satisfies Mercer's conditions (i.e., is positive semi-definite)?

Method 1: Eigenvalue Check (Numerical)

For a specific set of points, check that the kernel matrix has non-negative eigenvalues:

1. Sample n points from the domain
2. Compute the n × n kernel matrix K
3. Compute eigenvalues of K
4. Verify all eigenvalues ≥ 0 (within numerical tolerance)

This only checks p.s.d. for one point set. A function could pass this test but fail for other points. However, if it fails for any point set, it's definitely not a valid kernel.

Method 2: Known Constructions

Use kernel algebra: start from known valid kernels and combine them using operations that preserve p.s.d.:

Sum of valid kernels
Product of valid kernels
Positive scalar times valid kernel
Exponential of valid kernel
Polynomial with non-negative coefficients of valid kernel

verify_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
 
def verify_psd(kernel_func, n_samples=100, n_trials=10, tol=1e-10):
    """
    Numerically verify that a kernel function is positive semi-definite
    by checking eigenvalues of kernel matrices for random point sets.
    
    Returns:
        bool: True if all trials pass, False otherwise
        float: minimum eigenvalue observed
    """
    min_eigenvalue_seen = float('inf')
    
    for trial in range(n_trials):
        # Generate random points (in [-1, 1]^d)
        d = np.random.randint(1, 5)  # Random dimension
        X = np.random.uniform(-1, 1, size=(n_samples, d))
        
        # Compute kernel matrix
        K = np.zeros((n_samples, n_samples))
        for i in range(n_samples):
            for j in range(n_samples):
                K[i, j] = kernel_func(X[i], X[j])
        
        # Check symmetry
        if not np.allclose(K, K.T):
            print(f"  Trial {trial}: FAILED - not symmetric")
            return False, -np.inf
        
        # Check eigenvalues
        eigenvalues = np.linalg.eigvalsh(K)
        min_eig = np.min(eigenvalues)
        min_eigenvalue_seen = min(min_eigenvalue_seen, min_eig)
        
        if min_eig < -tol:
            print(f"  Trial {trial}: FAILED - negative eigenvalue: {min_eig}")
            return False, min_eig
    
    return True, min_eigenvalue_seen
 
# Test various kernel candidates
print("Testing kernel validity via eigenvalue analysis")
print("=" * 60)
 
# Valid kernels
def linear_kernel(x, y):
    return np.dot(x, y)
 
def rbf_kernel(x, y):
    return np.exp(-np.sum((x - y)**2))
 
def polynomial_kernel(x, y):
    return (np.dot(x, y) + 1)**3
 
# Invalid "kernels"
def negative_distance(x, y):
    return -np.linalg.norm(x - y)
 
def cosine_distance(x, y):
    return np.cos(np.linalg.norm(x - y))
 
def absolute_difference(x, y):
    # |x - y| can be written as sqrt(distance²), 
    # but this specific form is NOT p.s.d.
    return np.sum(np.abs(x - y))
 
candidates = [
    ("Linear kernel", linear_kernel),
    ("RBF kernel", rbf_kernel),
    ("Polynomial kernel (deg 3)", polynomial_kernel),
    ("-||x - y|| (neg distance)", negative_distance),
    ("cos(||x - y||)", cosine_distance),
    ("||x - y||₁ (L1 norm)", absolute_difference),
]
 
for name, func in candidates:
    is_valid, min_eig = verify_psd(func, n_samples=50, n_trials=5)
    status = "✓ Valid" if is_valid else "✗ Invalid"
    print(f"{name:30} {status:10} (min eigenvalue: {min_eig:.4f})")

Method 3: Direct Proof via Feature Map

Example: Proving RBF is Valid

The RBF kernel can be derived from the linear kernel through valid operations:

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp(-\gamma|\mathbf{x} - \mathbf{x}'|^2)$$

Expanding: $$= \exp(-\gamma|\mathbf{x}|^2) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot \exp(-\gamma|\mathbf{x}'|^2)$$

Let $c(\mathbf{x}) = \exp(-\gamma|\mathbf{x}|^2)$. Then: $$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = c(\mathbf{x}) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot c(\mathbf{x}')$$

Common Invalid 'Kernels'

Watch out for these common mistakes:

Characteristic and Universal Kernels

Beyond validity, some kernels have special properties that make them particularly useful. Two important classes are characteristic and universal kernels.

Characteristic Kernels

For characteristic kernels: $\text{MMD}(P, Q) = 0 \Leftrightarrow P = Q$.

Universal Kernels

Properties of Common Kernels
Kernel	Characteristic?	Universal?	RKHS Dimension
Linear	No	No	Finite (= input dim)
Polynomial (finite degree)	No	No	Finite
Gaussian RBF	Yes	Yes	Infinite
Laplacian	Yes	Yes	Infinite
Matérn (ν < ∞)	Yes	Yes (on compact)	Infinite

Practical Implications

• Use universal kernels (like Gaussian RBF) when you want maximum flexibility—they can learn any smooth function given enough data.

• Non-universal kernels (like polynomial) impose stronger inductive biases, which can be beneficial if they match the true structure.

• For distribution testing, ensure you use a characteristic kernel; otherwise MMD = 0 doesn't imply equal distributions.

Spectral Analysis of Kernels

Mercer's theorem provides a spectral decomposition of kernels. Analyzing this spectrum reveals deep properties of the kernel and the functions it can represent.

The Kernel Spectrum

For a kernel on a compact domain with measure $\mu$, the eigenvalues ${\lambda_j}$ of the integral operator $T_k$ form the spectrum. Key observations:

Sum of eigenvalues = trace of $T_k$ = $\int_\mathcal{X} k(\mathbf{x}, \mathbf{x}) , d\mu(\mathbf{x})$
Eigenvalue decay rate determines RKHS properties
Effective dimension (number of significant eigenvalues) indicates model complexity

Eigenvalue Decay and Smoothness

For the Gaussian RBF kernel on $[-1, 1]$ with bandwidth $\gamma$, eigenvalues decay exponentially: $$\lambda_j \sim \exp(-c \cdot j)$$

This rapid decay means only finitely many dimensions effectively contribute, explaining why RBF models are effectively low-dimensional despite infinite-dimensional feature spaces.

For contrast, polynomial kernels have finite rank (finitely many non-zero eigenvalues), while the linear kernel has rank equal to input dimension.

spectral_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
 
def kernel_spectrum(X, kernel_func, top_k=None):
    """
    Compute eigenvalue spectrum of kernel matrix.
    """
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j])
    
    eigenvalues = np.linalg.eigvalsh(K)
    eigenvalues = np.sort(eigenvalues)[::-1]  # Descending order
    
    if top_k:
        return eigenvalues[:top_k]
    return eigenvalues
 
# Compare spectra of different kernels
np.random.seed(42)
n = 200
X = np.linspace(-1, 1, n).reshape(-1, 1)
 
kernels = {
    'Linear': lambda x, y: np.dot(x, y),
    'Polynomial (d=3)': lambda x, y: (np.dot(x, y) + 1)**3,
    'RBF (γ=1)': lambda x, y: np.exp(-1 * np.sum((x-y)**2)),
    'RBF (γ=10)': lambda x, y: np.exp(-10 * np.sum((x-y)**2)),
    'Laplacian': lambda x, y: np.exp(-1 * np.sum(np.abs(x-y))),
}
 
print("Kernel Spectral Analysis")
print("=" * 70)
print(f"{'Kernel':25} {'Top-5 eigenvalues':>45}")
print("-" * 70)
 
spectra = {}
for name, k_func in kernels.items():
    spectrum = kernel_spectrum(X, k_func, top_k=20)
    spectra[name] = spectrum
    
    top5_str = ", ".join([f"{e:.2f}" for e in spectrum[:5]])
    print(f"{name:25} {top5_str:>45}")
 
# Compute effective dimension (number of eigenvalues > 1% of largest)
print("
Effective dimensions (eigenvalue > 1% of max):")
for name, spectrum in spectra.items():
    threshold = 0.01 * spectrum[0]
    eff_dim = np.sum(spectrum > threshold)
    energy_90 = np.searchsorted(np.cumsum(spectrum) / np.sum(spectrum), 0.9) + 1
    print(f"  {name:25} eff_dim: {eff_dim:3}, dims for 90% energy: {energy_90:3}")

Spectral Interpretation of Learning

The eigenvalue spectrum has profound implications for learning:

Large eigenvalues correspond to "easy" directions in function space—the model learns these first.
Small eigenvalues correspond to "hard" directions—these require more data or will be regularized away.
Regularization $\lambda |f|^2_{\mathcal{H}}$ effectively truncates the spectrum: components with $\lambda_j < \lambda$ are suppressed.
Generalization bounds depend on effective dimension—kernels with rapidly decaying spectra generalize better.

This spectral view connects kernel methods to principal component analysis: both project data onto leading eigendirections, suppressing noise in trailing directions.

Summary

Mercer's theorem provides the rigorous mathematical foundation for kernel methods, ensuring that valid kernels correspond to inner products in well-defined feature spaces.

Core Results

A function is a valid kernel if and only if it is positive semi-definite
Every valid kernel has a Mercer expansion in terms of eigenfunctions and eigenvalues
The expansion directly gives the feature map: $\phi(\mathbf{x}) = (\sqrt{\lambda_j} \psi_j(\mathbf{x}))_{j=1}^\infty$
Every kernel defines a unique RKHS with the reproducing property

Key Takeaways

•Mercer's theorem states that p.s.d. kernels have eigenfunction expansions with non-negative eigenvalues.
•The expansion gives the feature map — $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ where $\phi$ is constructed from eigenfunctions.
•RKHS provides the geometry — the kernel defines a Hilbert space of functions with the reproducing property.
•Eigenvalue decay indicates complexity — fast decay means effectively low-dimensional, good generalization.
•Universal kernels can approximate anything — kernels like RBF have dense RKHS in continuous functions.
•Verification methods exist — eigenvalue analysis, constructive proofs, or kernel algebra.

Page Complete