The Kernel Trick - Learning Module

Loading content...

0/245

Kernel Functions

The Mathematical Heart of Kernel Methods

In the previous page, we discovered a tantalizing possibility: computing inner products in feature space without explicitly constructing the feature vectors. This seemingly magical capability is made possible by kernel functions.

A kernel function is far more than a computational shortcut. It is a fundamental mathematical object that encapsulates our assumptions about similarity, structure, and the nature of functions we believe explain our data. Kernels are the bridge between the finite world of our observations and the infinite-dimensional spaces where linear methods become universally powerful.

In this page, we will develop a rigorous understanding of what kernel functions are, why they work, and how they transform machine learning algorithms.

What You Will Learn

By the end of this page, you will be able to: • Define kernel functions mathematically and intuitively • Verify whether a given function is a valid kernel • Understand the relationship between kernels and feature maps • Construct new kernels from existing ones using kernel algebra • Appreciate why kernels are central to nonparametric machine learning

Formal Definition of Kernel Functions

Let us begin with a precise mathematical definition.

Definition (Kernel Function)

A kernel function (or simply kernel) is a function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ that can be expressed as an inner product in some feature space. That is, there exists a feature map $\phi: \mathcal{X} \rightarrow \mathcal{H}$ into a Hilbert space $\mathcal{H}$ such that:

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$

for all $\mathbf{x}, \mathbf{x}' \in \mathcal{X}$.

The space $\mathcal{X}$ is the input space (often $\mathbb{R}^d$, but can be any set). The space $\mathcal{H}$ is the feature space or reproducing kernel Hilbert space (RKHS) associated with the kernel.

Key Properties

From this definition, several properties follow immediately:

Essential Kernel Properties

•Symmetry: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}', \mathbf{x})$ for all $\mathbf{x}, \mathbf{x}'$. This follows because inner products are symmetric: $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle = \langle \phi(\mathbf{x}'), \phi(\mathbf{x}) \rangle$.
•Non-negativity on the diagonal: $k(\mathbf{x}, \mathbf{x}) = |\phi(\mathbf{x})|^2 \geq 0$ for all $\mathbf{x}$. The kernel of a point with itself is the squared norm of its feature representation.
•Cauchy-Schwarz applies: $|k(\mathbf{x}, \mathbf{x}')|^2 \leq k(\mathbf{x}, \mathbf{x}) \cdot k(\mathbf{x}', \mathbf{x}')$. The similarity between two points is bounded by their self-similarities.
•Measures similarity: Large $k(\mathbf{x}, \mathbf{x}')$ indicates that $\mathbf{x}$ and $\mathbf{x}'$ are similar (their feature representations point in similar directions).

The Kernel as a Similarity Function

While the mathematical definition involves feature maps and inner products, the intuitive interpretation is simpler:

A kernel function measures similarity between inputs in a way that is compatible with linear methods.

When $k(\mathbf{x}, \mathbf{x}')$ is large, the points are "similar" according to the kernel. When it is small or zero, they are "dissimilar." Different kernels encode different notions of similarity—some care about angles, some about distances, some about local neighborhoods.

Why Inner Products?

Why do we specifically require inner products? Because many machine learning algorithms—ridge regression, SVM, PCA—can be formulated entirely in terms of inner products between data points. If we replace every inner product $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ with a kernel evaluation $k(\mathbf{x}_i, \mathbf{x}_j)$, the algorithm implicitly operates in the feature space defined by the kernel.

The Kernel Matrix (Gram Matrix)

Given a set of $n$ data points ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the kernel matrix (also called the Gram matrix) $\mathbf{K}$ is the $n \times n$ matrix of all pairwise kernel evaluations:

$$\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$$

In matrix form:

$$\mathbf{K} = \begin{pmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \cdots & k(\mathbf{x}_2, \mathbf{x}_n) \ \vdots & \vdots & \ddots & \vdots \ k(\mathbf{x}_n, \mathbf{x}_1) & k(\mathbf{x}_n, \mathbf{x}_2) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{pmatrix}$$

Connection to Feature Space

If $\Phi$ is the $n \times D$ feature matrix with rows $\phi(\mathbf{x}_i)^\top$, then:

$$\mathbf{K} = \Phi \Phi^\top$$

The kernel matrix is the Gram matrix of the feature vectors. This relationship is fundamental: the kernel matrix encodes all the inner products we need, without requiring us to know $\Phi$ explicitly.

Properties of the Kernel Matrix

The kernel matrix inherits properties from its inner product structure:

• Symmetric: $\mathbf{K} = \mathbf{K}^\top$ (because $k$ is symmetric) • Positive semi-definite: $\mathbf{z}^\top \mathbf{K} \mathbf{z} \geq 0$ for all $\mathbf{z} \in \mathbb{R}^n$ • Rank: rank$(\mathbf{K}) \leq \min(n, D)$ where $D$ is the feature dimension

The positive semi-definiteness is crucial—it characterizes valid kernels.

kernel_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
 
def compute_kernel_matrix(X, kernel_func):
    """
    Compute the kernel matrix for a dataset.
    
    Parameters:
    - X: (n_samples, n_features) input data
    - kernel_func: function(x_i, x_j) -> scalar
    
    Returns:
    - K: (n_samples, n_samples) kernel matrix
    """
    n = X.shape[0]
    K = np.zeros((n, n))
    
    for i in range(n):
        for j in range(i, n):  # Exploit symmetry
            K[i, j] = kernel_func(X[i], X[j])
            K[j, i] = K[i, j]  # Symmetric
    
    return K
 
# Example kernels
def linear_kernel(x, y):
    """k(x, y) = x · y"""
    return np.dot(x, y)
 
def polynomial_kernel(x, y, degree=2, c=1):
    """k(x, y) = (x · y + c)^d"""
    return (np.dot(x, y) + c) ** degree
 
def rbf_kernel(x, y, gamma=1.0):
    """k(x, y) = exp(-γ ||x - y||²)"""
    return np.exp(-gamma * np.sum((x - y)**2))
 
# Generate sample data
np.random.seed(42)
X = np.random.randn(5, 3)  # 5 points in 3D
 
# Compute kernel matrices
K_linear = compute_kernel_matrix(X, linear_kernel)
K_poly = compute_kernel_matrix(X, lambda x, y: polynomial_kernel(x, y, 2, 1))
K_rbf = compute_kernel_matrix(X, lambda x, y: rbf_kernel(x, y, 0.5))
 
print("Linear Kernel Matrix:")
print(np.round(K_linear, 3))
print(f"\nEigenvalues: {np.round(np.linalg.eigvalsh(K_linear), 4)}")
print(f"Positive semi-definite: {np.all(np.linalg.eigvalsh(K_linear) >= -1e-10)}")
 
print("\nRBF Kernel Matrix (diagonal is always 1):")
print(np.round(K_rbf, 3))

The Gram Matrix in Algorithms

The kernel matrix $\mathbf{K}$ becomes the central data structure in kernel methods. Instead of the design matrix $\mathbf{X}$, algorithms receive $\mathbf{K}$:

Kernel Ridge Regression: Solve $(\mathbf{K} + \lambda \mathbf{I})\boldsymbol{\alpha} = \mathbf{y}$
Kernel PCA: Eigendecompose the centered kernel matrix
Kernel SVM: Optimize a quadratic program involving $\mathbf{K}$

The computational complexity shifts from depending on feature dimension $D$ to depending on sample size $n$. When $D$ is very large or infinite, but $n$ is manageable, this is a massive win.

Computing Kernels Efficiently

The power of the kernel trick lies in computing $k(\mathbf{x}, \mathbf{x}')$ more efficiently than explicitly computing $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$. Let us examine this for several important kernels.

Linear Kernel

The simplest kernel is the linear kernel:

$$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$$

Here, the feature map is the identity: $\phi(\mathbf{x}) = \mathbf{x}$. The kernel trivially equals the standard inner product.

Computational cost: $O(d)$ where $d$ is input dimension.

Polynomial Kernel

The polynomial kernel of degree $p$ is:

$$k_{\text{poly}}(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^p$$

where $c \geq 0$ is a constant (often 0 or 1).

Computational cost: $O(d)$ — one inner product plus exponentiation.

Feature space dimension: $\binom{d + p}{p}$ — exponential in $p$.

We compute the kernel in $O(d)$, but the implicit feature space has dimension exponential in the degree!

Why the Polynomial Kernel Works

The polynomial kernel $(\mathbf{x}^\top \mathbf{x}' + c)^p$ expands via the binomial/multinomial theorem into a sum of monomials. Each monomial corresponds to a component of the implicit feature vector. The remarkable fact is that the sum of products (inner product in feature space) equals a simple power of the original inner product.

Derivation: Degree-2 Polynomial Kernel

Let us verify this for $p = 2$ with $c = 0$:

$$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}')^2 = \left( \sum_{i=1}^d x_i x'_i \right)^2$$

Expanding:

$$= \sum_{i=1}^d \sum_{j=1}^d x_i x_j x'i x'j = \sum{i=1}^d \sum{j=1}^d (x_i x_j)(x'_i x'_j)$$

This is an inner product between vectors: $$\phi(\mathbf{x}) = (x_1 x_1, x_1 x_2, \ldots, x_1 x_d, x_2 x_1, \ldots, x_d x_d)$$

The feature map produces all products $x_i x_j$ for $i, j \in {1, \ldots, d}$, giving $d^2$ features. (With appropriate scaling, we can obtain the symmetric form with $\binom{d+1}{2}$ unique features.)

polynomial_kernel_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from itertools import product
 
def explicit_degree2_features(x):
    """
    Explicit feature map for (x·x')² kernel.
    Returns all products x_i * x_j.
    """
    d = len(x)
    phi = []
    for i in range(d):
        for j in range(d):
            phi.append(x[i] * x[j])
    return np.array(phi)
 
def polynomial_kernel_d2(x, y):
    """Compute (x · y)² directly"""
    return np.dot(x, y) ** 2
 
# Test: verify kernel equals feature inner product
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])
 
# Method 1: Explicit features
phi_x = explicit_degree2_features(x)
phi_y = explicit_degree2_features(y)
inner_product_explicit = np.dot(phi_x, phi_y)
 
# Method 2: Kernel trick
kernel_value = polynomial_kernel_d2(x, y)
 
print(f"Dimension of x: {len(x)}")
print(f"Dimension of φ(x): {len(phi_x)}")
print(f"")
print(f"x · y = {np.dot(x, y)}")
print(f"(x · y)² = {kernel_value}")
print(f"")
print(f"<φ(x), φ(y)> = {inner_product_explicit}")
print(f"")
print(f"Equal? {np.allclose(kernel_value, inner_product_explicit)}")
 
# The kernel computation is O(d) = O(3)
# The explicit computation is O(d²) = O(9)
# For degree p, it would be O(d^p) vs O(d)!

Gaussian (RBF) Kernel

The Gaussian or Radial Basis Function (RBF) kernel is:

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right) = \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2} \right)$$

where $\gamma = \frac{1}{2\sigma^2}$ is the bandwidth parameter.

Computational cost: $O(d)$ — compute squared distance and exponentiate.

Feature space dimension: $\infty$!

The Gaussian kernel corresponds to an infinite-dimensional feature space. We can verify this through the Taylor expansion of the exponential, but the key point is: we compute finite-cost operations on infinite-dimensional inner products.

Kernel Computation vs. Explicit Feature Cost
Kernel	Kernel Cost	Feature Dimension D	Explicit Cost
Linear	O(d)	d	O(d)
Polynomial (degree p)	O(d)	$\binom{d+p}{p}$	O(D)
Polynomial (degree 3, d=100)	O(100)	176,851	O(176,851)
RBF/Gaussian	O(d)	∞	Impossible

The Positive Semi-Definite Condition

Not every function $k(\mathbf{x}, \mathbf{x}'): \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a valid kernel. The defining characteristic of a valid kernel is positive semi-definiteness.

Definition (Positive Semi-Definite Kernel)

A symmetric function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is positive semi-definite (p.s.d.) if for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}_n} \subset \mathcal{X}$ and any vector $\mathbf{c} = (c_1, \ldots, c_n)^\top \in \mathbb{R}^n$:

$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$

Equivalently, the kernel matrix $\mathbf{K}$ for any finite subset of $\mathcal{X}$ must be positive semi-definite: $\mathbf{c}^\top \mathbf{K} \mathbf{c} \geq 0$ for all $\mathbf{c}$.

Why This Condition?

The p.s.d. condition is precisely what makes a function behave like an inner product. Consider:

$$\sum_{i,j} c_i c_j \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle = \left\langle \sum_i c_i \phi(\mathbf{x}_i), \sum_j c_j \phi(\mathbf{x}_j) \right\rangle = \left| \sum_i c_i \phi(\mathbf{x}_i) \right|^2 \geq 0$$

Any kernel derived from a feature map automatically satisfies p.s.d. The remarkable converse—that every p.s.d. function corresponds to some feature map—is Mercer's theorem, which we explore in the next page.

Not All Similarity Functions Are Kernels!

Many natural-seeming similarity functions fail the p.s.d. condition:

• $k(\mathbf{x}, \mathbf{x}') = -|\mathbf{x} - \mathbf{x}'|$ (negative distance) • $k(\mathbf{x}, \mathbf{x}') = \cos(|\mathbf{x} - \mathbf{x}'|)$ (oscillating distance) • Some graph similarity measures

Using non-p.s.d. functions as "kernels" can cause algorithms to fail (e.g., optimization becomes non-convex, covariance matrices become invalid).

psd_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def is_positive_semidefinite(K, tol=1e-10):
    """
    Check if a matrix is positive semi-definite.
    
    A matrix is p.s.d. if all eigenvalues are non-negative.
    """
    eigenvalues = np.linalg.eigvalsh(K)  # Real eigenvalues (K is symmetric)
    min_eigenvalue = np.min(eigenvalues)
    return min_eigenvalue >= -tol, min_eigenvalue, eigenvalues
 
# Generate sample data
np.random.seed(42)
X = np.random.randn(10, 3)
 
# Valid kernel: RBF
def build_rbf_kernel(X, gamma=1.0):
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = np.exp(-gamma * np.sum((X[i] - X[j])**2))
    return K
 
# Invalid "kernel": negative distance
def build_neg_distance_matrix(X):
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = -np.linalg.norm(X[i] - X[j])
    return K
 
K_rbf = build_rbf_kernel(X)
K_bad = build_neg_distance_matrix(X)
 
is_psd_rbf, min_eig_rbf, _ = is_positive_semidefinite(K_rbf)
is_psd_bad, min_eig_bad, eigs_bad = is_positive_semidefinite(K_bad)
 
print("RBF Kernel Matrix:")
print(f"  Is p.s.d.? {is_psd_rbf}")
print(f"  Min eigenvalue: {min_eig_rbf:.6f}")
 
print("\nNegative Distance 'Kernel':")
print(f"  Is p.s.d.? {is_psd_bad}")
print(f"  Min eigenvalue: {min_eig_bad:.4f}")
print(f"  Eigenvalues: {np.round(eigs_bad, 3)}")
 
# The negative distance matrix has negative eigenvalues!
# Using it as a kernel would cause algorithms to fail.

Practical Implications of PSD Failures

When an algorithm assumes a p.s.d. kernel matrix but receives one that isn't:

Kernel Ridge Regression: The matrix inversion $( \mathbf{K} + \lambda \mathbf{I})^{-1}$ may produce unstable results
Gaussian Processes: The "covariance" matrix isn't valid, predictions have complex variances
SVM: The quadratic program may not have a global optimum
Kernel PCA: Eigenvalues can be negative, eigenvector interpretation breaks

Always verify that your kernel is p.s.d. before using it in algorithms that assume this property.

Kernel Algebra: Building New Kernels

One of the most powerful aspects of kernels is that they can be combined to create new kernels. The set of p.s.d. kernels is closed under several operations, allowing us to construct complex kernels from simple building blocks.

Kernel Closure Properties

If $k_1$ and $k_2$ are valid (p.s.d.) kernels on $\mathcal{X}$, then the following are also valid kernels:

Operations That Preserve Kernel Validity

•Scalar multiplication: $k(\mathbf{x}, \mathbf{x}') = c \cdot k_1(\mathbf{x}, \mathbf{x}')$ for $c > 0$
•Addition: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$
•Product: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$
•Polynomial: $k(\mathbf{x}, \mathbf{x}') = p(k_1(\mathbf{x}, \mathbf{x}'))$ where $p$ is a polynomial with non-negative coefficients
•Exponential: $k(\mathbf{x}, \mathbf{x}') = \exp(k_1(\mathbf{x}, \mathbf{x}'))$
•Composition with functions: $k(\mathbf{x}, \mathbf{x}') = k_1(f(\mathbf{x}), f(\mathbf{x}'))$ for any function $f$
•Tensor product: For kernels on different spaces, $(k_1 \otimes k_2)((\mathbf{x}, \mathbf{z}), (\mathbf{x}', \mathbf{z}')) = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{z}, \mathbf{z}')$

Feature Space Interpretation

Each algebraic operation has a feature space interpretation: • Addition: Concatenates feature spaces: $\phi(\mathbf{x}) = [\phi_1(\mathbf{x}); \phi_2(\mathbf{x})]$ • Product: Takes tensor product of features: $\phi(\mathbf{x}) = \phi_1(\mathbf{x}) \otimes \phi_2(\mathbf{x})$ • Polynomial/Exponential: Implicitly creates higher-order feature interactions

Deriving the RBF Kernel

Kernel algebra provides a beautiful derivation of the Gaussian RBF kernel from the linear kernel. Starting with the squared exponential:

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right)$$

Expanding the squared norm:

$$= \exp\left( -\gamma (|\mathbf{x}|^2 - 2\mathbf{x}^\top\mathbf{x}' + |\mathbf{x}'|^2) \right)$$

$$= \exp(-\gamma |\mathbf{x}|^2) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot \exp(-\gamma |\mathbf{x}'|^2)$$

The middle term, $\exp(2\gamma \mathbf{x}^\top\mathbf{x}')$, is an exponential of the linear kernel—valid by the exponential rule. The outer terms are constant factors for each point (they can be absorbed into a normalized feature map).

Thus, the RBF kernel inherits its validity from the linear kernel through the exponential operation!

kernel_algebra.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
 
# Base kernels
def linear_kernel(x, y):
    return np.dot(x, y)
 
def poly_kernel(x, y, d=2, c=1):
    return (np.dot(x, y) + c) ** d
 
def rbf_kernel(x, y, gamma=1.0):
    return np.exp(-gamma * np.sum((x - y)**2))
 
# Kernel algebra: combining kernels
def kernel_sum(k1, k2):
    """k(x,y) = k1(x,y) + k2(x,y)"""
    return lambda x, y: k1(x, y) + k2(x, y)
 
def kernel_product(k1, k2):
    """k(x,y) = k1(x,y) * k2(x,y)"""
    return lambda x, y: k1(x, y) * k2(x, y)
 
def kernel_scale(k, c):
    """k(x,y) = c * k(x,y)"""
    return lambda x, y: c * k(x, y)
 
# Example: Create a hybrid kernel
# k(x,y) = 0.5 * rbf(x,y) + 0.3 * poly(x,y) + 0.2 * linear(x,y)
hybrid = kernel_sum(
    kernel_scale(rbf_kernel, 0.5),
    kernel_sum(
        kernel_scale(lambda x, y: poly_kernel(x, y, 2, 1), 0.3),
        kernel_scale(linear_kernel, 0.2)
    )
)
 
# Verify on sample data
np.random.seed(42)
X = np.random.randn(5, 3)
 
# Build hybrid kernel matrix
n = X.shape[0]
K_hybrid = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        K_hybrid[i, j] = hybrid(X[i], X[j])
 
# Verify p.s.d.
eigenvalues = np.linalg.eigvalsh(K_hybrid)
print("Hybrid Kernel Matrix:")
print(np.round(K_hybrid, 4))
print(f"\nEigenvalues: {np.round(eigenvalues, 4)}")
print(f"All eigenvalues ≥ 0? {np.all(eigenvalues >= -1e-10)}")
print("\nThe hybrid kernel is valid (p.s.d.)!")

The Representer Theorem

The Representer Theorem is the mathematical result that justifies kernel methods. It tells us that even when optimizing over infinite-dimensional function spaces, the optimal solution can always be expressed in terms of the training data.

Statement (Simplified)

Consider minimizing a regularized empirical risk over a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$:

$$\min_{f \in \mathcal{H}} \left[ \sum_{i=1}^n L(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}}^2 \right]$$

where $L$ is any loss function and $|f|_{\mathcal{H}}$ is the RKHS norm. Then the minimizer has the form:

$$f^*(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, \mathbf{x})$$

for some coefficients $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$.

Implications

This is remarkable: the optimal function, which could in principle be any element of an infinite-dimensional space, is actually a finite linear combination of the kernel evaluated at training points.

Why the Representer Theorem Matters

The Representer Theorem converts an infinite-dimensional optimization problem into a finite-dimensional one. Instead of searching over all possible functions in the RKHS, we only need to find $n$ coefficients $\alpha_1, \ldots, \alpha_n$. This makes kernel methods computationally tractable.

Intuition for the Result

Why should the optimal function be expressible in terms of training points? Consider these observations:

Function values at training points determine the loss: The empirical risk only depends on $f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)$.
The regularizer penalizes "unnecessary complexity": Adding components orthogonal to the training data increases $|f|_{\mathcal{H}}^2$ without improving the loss.
The optimal balance: The minimizer uses exactly the components needed to fit the training data, without extra complexity.

Formally, we can decompose any function $f \in \mathcal{H}$ as: $$f = f_\parallel + f_\perp$$ where $f_\parallel$ lies in the span of ${k(\cdot, \mathbf{x}i)}{i=1}^n$ and $f_\perp$ is orthogonal to this span.

Since $f_\perp(\mathbf{x}i) = 0$ for all training points (by orthogonality), it doesn't affect the loss. But $|f|^2 = |f\parallel|^2 + |f_\perp|^2 \geq |f_\parallel|^2$, so the regularizer is strictly worse with $f_\perp \neq 0$.

Therefore, the optimal $f^$ has $f^\perp = 0$, meaning $f^* = f^*\parallel \in \text{span}{k(\cdot, \mathbf{x}_i)}$.

representer_theorem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def kernel_ridge_regression(K, y, lambda_reg):
    """
    Solve kernel ridge regression.
    
    By the Representer Theorem, the solution is:
        f(x) = sum_i alpha_i * k(x_i, x)
    
    where alpha = (K + λI)^{-1} y
    
    Parameters:
    - K: (n, n) kernel matrix
    - y: (n,) targets
    - lambda_reg: regularization parameter
    
    Returns:
    - alpha: (n,) coefficients
    """
    n = K.shape[0]
    alpha = np.linalg.solve(K + lambda_reg * np.eye(n), y)
    return alpha
 
def predict(K_test, alpha):
    """
    Make predictions using learned coefficients.
    
    K_test[i, j] = k(x_test_i, x_train_j)
    
    prediction[i] = sum_j alpha[j] * k(x_test_i, x_train_j)
                  = K_test[i, :] @ alpha
    """
    return K_test @ alpha
 
# Example: nonlinear regression with RBF kernel
np.random.seed(42)
 
# Training data
X_train = np.linspace(0, 4*np.pi, 20).reshape(-1, 1)
y_train = np.sin(X_train.ravel()) + 0.1 * np.random.randn(20)
 
# Test data
X_test = np.linspace(0, 4*np.pi, 100).reshape(-1, 1)
y_test_true = np.sin(X_test.ravel())
 
# Compute kernel matrices
gamma = 1.0
def rbf_kernel_matrix(X1, X2, gamma):
    K = np.zeros((X1.shape[0], X2.shape[0]))
    for i in range(X1.shape[0]):
        for j in range(X2.shape[0]):
            K[i, j] = np.exp(-gamma * np.sum((X1[i] - X2[j])**2))
    return K
 
K_train = rbf_kernel_matrix(X_train, X_train, gamma)
K_test = rbf_kernel_matrix(X_test, X_train, gamma)
 
# Solve kernel ridge regression
lambda_reg = 0.01
alpha = kernel_ridge_regression(K_train, y_train, lambda_reg)
 
# Predictions
y_pred = predict(K_test, alpha)
 
print(f"Number of training points (n): {len(X_train)}")
print(f"Number of coefficients (alpha): {len(alpha)}")
print(f"")
print(f"By the Representer Theorem, f(x) = Σᵢ αᵢ k(xᵢ, x)")
print(f"")
print(f"Test MSE: {np.mean((y_pred - y_test_true)**2):.6f}")

Kernels as Measures of Similarity

Beyond the formal mathematics, kernels can be understood intuitively as similarity functions with special properties. This perspective is valuable for designing kernels and understanding what assumptions they encode.

The Similarity Interpretation

When we choose a kernel $k$, we are implicitly stating: "I believe two inputs $\mathbf{x}$ and $\mathbf{x}'$ should contribute similarly to predictions if $k(\mathbf{x}, \mathbf{x}')$ is large."

Different kernels encode different notions of "similar":

Similarity Interpretations of Common Kernels
Kernel	What Makes x and x' Similar?
Linear: $\mathbf{x}^\top \mathbf{x}'$	Correlated components, similar direction in feature space
Polynomial: $(\mathbf{x}^\top \mathbf{x}' + c)^p$	Similar direction + matching interaction patterns
RBF: $\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)$	Close in Euclidean distance (locality)
Laplacian: $\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|_1)$	Close in Manhattan distance (more robust to outliers)
Sigmoid: $\tanh(a \mathbf{x}^\top \mathbf{x}' + c)$	Neural network-like response patterns

The RBF Kernel's Locality

The RBF kernel $k(\mathbf{x}, \mathbf{x}') = \exp(-\gamma|\mathbf{x}-\mathbf{x}'|^2)$ has a particularly intuitive interpretation:

When $\mathbf{x} = \mathbf{x}'$: $k = 1$ (maximum similarity)
As $|\mathbf{x} - \mathbf{x}'|$ increases: $k \rightarrow 0$ (decreasing similarity)
The rate of decay is controlled by $\gamma$:
- Large $\gamma$: rapid decay, only very close points are similar
- Small $\gamma$: slow decay, distant points still have influence

This local behavior makes RBF kernels excellent for interpolating smooth functions: predictions at a point $\mathbf{x}$ are dominated by nearby training points.

Kernel Selection as Prior Specification

Choosing a kernel is analogous to choosing a prior in Bayesian inference. The kernel encodes our beliefs about the function space—how smooth functions should be, whether periodicity is expected, whether interactions matter. Poor kernel choice leads to poor generalization, just as a misspecified prior leads to poor posterior inference.

Normalized Kernels

For interpretation as pure similarity (not scaled by point "magnitudes"), we often use normalized kernels:

$$\tilde{k}(\mathbf{x}, \mathbf{x}') = \frac{k(\mathbf{x}, \mathbf{x}')}{\sqrt{k(\mathbf{x}, \mathbf{x}) \cdot k(\mathbf{x}', \mathbf{x}')}}$$

The normalized kernel satisfies:

$\tilde{k}(\mathbf{x}, \mathbf{x}) = 1$ for all $\mathbf{x}$ (perfect self-similarity)
$|\tilde{k}(\mathbf{x}, \mathbf{x}')| \leq 1$ (bounded similarity)
Equivalent to inner product of unit-length feature vectors

The RBF kernel is already normalized: $k(\mathbf{x}, \mathbf{x}) = 1$ for all $\mathbf{x}$. The linear kernel is not, so normalized linear kernel becomes the cosine similarity.

Summary

We have developed a comprehensive understanding of kernel functions—the mathematical objects that make the kernel trick possible.

Core Concepts

A kernel function $k(\mathbf{x}, \mathbf{x}')$ computes an inner product in some feature space
The kernel matrix $\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ encodes all pairwise similarities
A function is a valid kernel if and only if it is positive semi-definite
Kernel algebra allows us to combine simple kernels into complex ones
The Representer Theorem guarantees that optimal solutions lie in a finite-dimensional subspace

Key Takeaways

•Kernels compute feature-space inner products — $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ without explicit feature computation.
•Positive semi-definiteness is the validity criterion — not all similarity functions are kernels.
•Kernel matrices replace design matrices — algorithms operate on $\mathbf{K}$ instead of $\mathbf{X}$.
•Kernels can be combined algebraically — sums, products, and transformations of valid kernels are valid.
•The Representer Theorem makes optimization tractable — solutions are linear combinations of kernel evaluations at training points.
•Kernels encode similarity assumptions — choosing a kernel is choosing an inductive bias.

Page Complete

You now have a solid foundation in kernel functions. In the next page, we will explore the computational advantage of kernels—how they enable algorithms to work efficiently in very high or infinite-dimensional feature spaces.