Loading content...
In the previous page, we discovered a tantalizing possibility: computing inner products in feature space without explicitly constructing the feature vectors. This seemingly magical capability is made possible by kernel functions.
A kernel function is far more than a computational shortcut. It is a fundamental mathematical object that encapsulates our assumptions about similarity, structure, and the nature of functions we believe explain our data. Kernels are the bridge between the finite world of our observations and the infinite-dimensional spaces where linear methods become universally powerful.
In this page, we will develop a rigorous understanding of what kernel functions are, why they work, and how they transform machine learning algorithms.
By the end of this page, you will be able to: • Define kernel functions mathematically and intuitively • Verify whether a given function is a valid kernel • Understand the relationship between kernels and feature maps • Construct new kernels from existing ones using kernel algebra • Appreciate why kernels are central to nonparametric machine learning
Let us begin with a precise mathematical definition.
Definition (Kernel Function)
A kernel function (or simply kernel) is a function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ that can be expressed as an inner product in some feature space. That is, there exists a feature map $\phi: \mathcal{X} \rightarrow \mathcal{H}$ into a Hilbert space $\mathcal{H}$ such that:
$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}$$
for all $\mathbf{x}, \mathbf{x}' \in \mathcal{X}$.
The space $\mathcal{X}$ is the input space (often $\mathbb{R}^d$, but can be any set). The space $\mathcal{H}$ is the feature space or reproducing kernel Hilbert space (RKHS) associated with the kernel.
Key Properties
From this definition, several properties follow immediately:
The Kernel as a Similarity Function
While the mathematical definition involves feature maps and inner products, the intuitive interpretation is simpler:
A kernel function measures similarity between inputs in a way that is compatible with linear methods.
When $k(\mathbf{x}, \mathbf{x}')$ is large, the points are "similar" according to the kernel. When it is small or zero, they are "dissimilar." Different kernels encode different notions of similarity—some care about angles, some about distances, some about local neighborhoods.
Why do we specifically require inner products? Because many machine learning algorithms—ridge regression, SVM, PCA—can be formulated entirely in terms of inner products between data points. If we replace every inner product $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ with a kernel evaluation $k(\mathbf{x}_i, \mathbf{x}_j)$, the algorithm implicitly operates in the feature space defined by the kernel.
Given a set of $n$ data points ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$, the kernel matrix (also called the Gram matrix) $\mathbf{K}$ is the $n \times n$ matrix of all pairwise kernel evaluations:
$$\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$$
In matrix form:
$$\mathbf{K} = \begin{pmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \cdots & k(\mathbf{x}_2, \mathbf{x}_n) \ \vdots & \vdots & \ddots & \vdots \ k(\mathbf{x}_n, \mathbf{x}_1) & k(\mathbf{x}_n, \mathbf{x}_2) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{pmatrix}$$
Connection to Feature Space
If $\Phi$ is the $n \times D$ feature matrix with rows $\phi(\mathbf{x}_i)^\top$, then:
$$\mathbf{K} = \Phi \Phi^\top$$
The kernel matrix is the Gram matrix of the feature vectors. This relationship is fundamental: the kernel matrix encodes all the inner products we need, without requiring us to know $\Phi$ explicitly.
The kernel matrix inherits properties from its inner product structure:
• Symmetric: $\mathbf{K} = \mathbf{K}^\top$ (because $k$ is symmetric) • Positive semi-definite: $\mathbf{z}^\top \mathbf{K} \mathbf{z} \geq 0$ for all $\mathbf{z} \in \mathbb{R}^n$ • Rank: rank$(\mathbf{K}) \leq \min(n, D)$ where $D$ is the feature dimension
The positive semi-definiteness is crucial—it characterizes valid kernels.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as np def compute_kernel_matrix(X, kernel_func): """ Compute the kernel matrix for a dataset. Parameters: - X: (n_samples, n_features) input data - kernel_func: function(x_i, x_j) -> scalar Returns: - K: (n_samples, n_samples) kernel matrix """ n = X.shape[0] K = np.zeros((n, n)) for i in range(n): for j in range(i, n): # Exploit symmetry K[i, j] = kernel_func(X[i], X[j]) K[j, i] = K[i, j] # Symmetric return K # Example kernelsdef linear_kernel(x, y): """k(x, y) = x · y""" return np.dot(x, y) def polynomial_kernel(x, y, degree=2, c=1): """k(x, y) = (x · y + c)^d""" return (np.dot(x, y) + c) ** degree def rbf_kernel(x, y, gamma=1.0): """k(x, y) = exp(-γ ||x - y||²)""" return np.exp(-gamma * np.sum((x - y)**2)) # Generate sample datanp.random.seed(42)X = np.random.randn(5, 3) # 5 points in 3D # Compute kernel matricesK_linear = compute_kernel_matrix(X, linear_kernel)K_poly = compute_kernel_matrix(X, lambda x, y: polynomial_kernel(x, y, 2, 1))K_rbf = compute_kernel_matrix(X, lambda x, y: rbf_kernel(x, y, 0.5)) print("Linear Kernel Matrix:")print(np.round(K_linear, 3))print(f"\nEigenvalues: {np.round(np.linalg.eigvalsh(K_linear), 4)}")print(f"Positive semi-definite: {np.all(np.linalg.eigvalsh(K_linear) >= -1e-10)}") print("\nRBF Kernel Matrix (diagonal is always 1):")print(np.round(K_rbf, 3))The Gram Matrix in Algorithms
The kernel matrix $\mathbf{K}$ becomes the central data structure in kernel methods. Instead of the design matrix $\mathbf{X}$, algorithms receive $\mathbf{K}$:
The computational complexity shifts from depending on feature dimension $D$ to depending on sample size $n$. When $D$ is very large or infinite, but $n$ is manageable, this is a massive win.
The power of the kernel trick lies in computing $k(\mathbf{x}, \mathbf{x}')$ more efficiently than explicitly computing $\langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$. Let us examine this for several important kernels.
Linear Kernel
The simplest kernel is the linear kernel:
$$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$$
Here, the feature map is the identity: $\phi(\mathbf{x}) = \mathbf{x}$. The kernel trivially equals the standard inner product.
Computational cost: $O(d)$ where $d$ is input dimension.
Polynomial Kernel
The polynomial kernel of degree $p$ is:
$$k_{\text{poly}}(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}' + c)^p$$
where $c \geq 0$ is a constant (often 0 or 1).
Computational cost: $O(d)$ — one inner product plus exponentiation.
Feature space dimension: $\binom{d + p}{p}$ — exponential in $p$.
We compute the kernel in $O(d)$, but the implicit feature space has dimension exponential in the degree!
The polynomial kernel $(\mathbf{x}^\top \mathbf{x}' + c)^p$ expands via the binomial/multinomial theorem into a sum of monomials. Each monomial corresponds to a component of the implicit feature vector. The remarkable fact is that the sum of products (inner product in feature space) equals a simple power of the original inner product.
Derivation: Degree-2 Polynomial Kernel
Let us verify this for $p = 2$ with $c = 0$:
$$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top \mathbf{x}')^2 = \left( \sum_{i=1}^d x_i x'_i \right)^2$$
Expanding:
$$= \sum_{i=1}^d \sum_{j=1}^d x_i x_j x'i x'j = \sum{i=1}^d \sum{j=1}^d (x_i x_j)(x'_i x'_j)$$
This is an inner product between vectors: $$\phi(\mathbf{x}) = (x_1 x_1, x_1 x_2, \ldots, x_1 x_d, x_2 x_1, \ldots, x_d x_d)$$
The feature map produces all products $x_i x_j$ for $i, j \in {1, \ldots, d}$, giving $d^2$ features. (With appropriate scaling, we can obtain the symmetric form with $\binom{d+1}{2}$ unique features.)
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as npfrom itertools import product def explicit_degree2_features(x): """ Explicit feature map for (x·x')² kernel. Returns all products x_i * x_j. """ d = len(x) phi = [] for i in range(d): for j in range(d): phi.append(x[i] * x[j]) return np.array(phi) def polynomial_kernel_d2(x, y): """Compute (x · y)² directly""" return np.dot(x, y) ** 2 # Test: verify kernel equals feature inner productx = np.array([1.0, 2.0, 3.0])y = np.array([4.0, 5.0, 6.0]) # Method 1: Explicit featuresphi_x = explicit_degree2_features(x)phi_y = explicit_degree2_features(y)inner_product_explicit = np.dot(phi_x, phi_y) # Method 2: Kernel trickkernel_value = polynomial_kernel_d2(x, y) print(f"Dimension of x: {len(x)}")print(f"Dimension of φ(x): {len(phi_x)}")print(f"")print(f"x · y = {np.dot(x, y)}")print(f"(x · y)² = {kernel_value}")print(f"")print(f"<φ(x), φ(y)> = {inner_product_explicit}")print(f"")print(f"Equal? {np.allclose(kernel_value, inner_product_explicit)}") # The kernel computation is O(d) = O(3)# The explicit computation is O(d²) = O(9)# For degree p, it would be O(d^p) vs O(d)!Gaussian (RBF) Kernel
The Gaussian or Radial Basis Function (RBF) kernel is:
$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right) = \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2} \right)$$
where $\gamma = \frac{1}{2\sigma^2}$ is the bandwidth parameter.
Computational cost: $O(d)$ — compute squared distance and exponentiate.
Feature space dimension: $\infty$!
The Gaussian kernel corresponds to an infinite-dimensional feature space. We can verify this through the Taylor expansion of the exponential, but the key point is: we compute finite-cost operations on infinite-dimensional inner products.
| Kernel | Kernel Cost | Feature Dimension D | Explicit Cost |
|---|---|---|---|
| Linear | O(d) | d | O(d) |
| Polynomial (degree p) | O(d) | $\binom{d+p}{p}$ | O(D) |
| Polynomial (degree 3, d=100) | O(100) | 176,851 | O(176,851) |
| RBF/Gaussian | O(d) | ∞ | Impossible |
Not every function $k(\mathbf{x}, \mathbf{x}'): \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a valid kernel. The defining characteristic of a valid kernel is positive semi-definiteness.
Definition (Positive Semi-Definite Kernel)
A symmetric function $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is positive semi-definite (p.s.d.) if for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}_n} \subset \mathcal{X}$ and any vector $\mathbf{c} = (c_1, \ldots, c_n)^\top \in \mathbb{R}^n$:
$$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$
Equivalently, the kernel matrix $\mathbf{K}$ for any finite subset of $\mathcal{X}$ must be positive semi-definite: $\mathbf{c}^\top \mathbf{K} \mathbf{c} \geq 0$ for all $\mathbf{c}$.
Why This Condition?
The p.s.d. condition is precisely what makes a function behave like an inner product. Consider:
$$\sum_{i,j} c_i c_j \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle = \left\langle \sum_i c_i \phi(\mathbf{x}_i), \sum_j c_j \phi(\mathbf{x}_j) \right\rangle = \left| \sum_i c_i \phi(\mathbf{x}_i) \right|^2 \geq 0$$
Any kernel derived from a feature map automatically satisfies p.s.d. The remarkable converse—that every p.s.d. function corresponds to some feature map—is Mercer's theorem, which we explore in the next page.
Many natural-seeming similarity functions fail the p.s.d. condition:
• $k(\mathbf{x}, \mathbf{x}') = -|\mathbf{x} - \mathbf{x}'|$ (negative distance) • $k(\mathbf{x}, \mathbf{x}') = \cos(|\mathbf{x} - \mathbf{x}'|)$ (oscillating distance) • Some graph similarity measures
Using non-p.s.d. functions as "kernels" can cause algorithms to fail (e.g., optimization becomes non-convex, covariance matrices become invalid).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np def is_positive_semidefinite(K, tol=1e-10): """ Check if a matrix is positive semi-definite. A matrix is p.s.d. if all eigenvalues are non-negative. """ eigenvalues = np.linalg.eigvalsh(K) # Real eigenvalues (K is symmetric) min_eigenvalue = np.min(eigenvalues) return min_eigenvalue >= -tol, min_eigenvalue, eigenvalues # Generate sample datanp.random.seed(42)X = np.random.randn(10, 3) # Valid kernel: RBFdef build_rbf_kernel(X, gamma=1.0): n = X.shape[0] K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = np.exp(-gamma * np.sum((X[i] - X[j])**2)) return K # Invalid "kernel": negative distancedef build_neg_distance_matrix(X): n = X.shape[0] K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = -np.linalg.norm(X[i] - X[j]) return K K_rbf = build_rbf_kernel(X)K_bad = build_neg_distance_matrix(X) is_psd_rbf, min_eig_rbf, _ = is_positive_semidefinite(K_rbf)is_psd_bad, min_eig_bad, eigs_bad = is_positive_semidefinite(K_bad) print("RBF Kernel Matrix:")print(f" Is p.s.d.? {is_psd_rbf}")print(f" Min eigenvalue: {min_eig_rbf:.6f}") print("\nNegative Distance 'Kernel':")print(f" Is p.s.d.? {is_psd_bad}")print(f" Min eigenvalue: {min_eig_bad:.4f}")print(f" Eigenvalues: {np.round(eigs_bad, 3)}") # The negative distance matrix has negative eigenvalues!# Using it as a kernel would cause algorithms to fail.Practical Implications of PSD Failures
When an algorithm assumes a p.s.d. kernel matrix but receives one that isn't:
Always verify that your kernel is p.s.d. before using it in algorithms that assume this property.
One of the most powerful aspects of kernels is that they can be combined to create new kernels. The set of p.s.d. kernels is closed under several operations, allowing us to construct complex kernels from simple building blocks.
Kernel Closure Properties
If $k_1$ and $k_2$ are valid (p.s.d.) kernels on $\mathcal{X}$, then the following are also valid kernels:
Each algebraic operation has a feature space interpretation: • Addition: Concatenates feature spaces: $\phi(\mathbf{x}) = [\phi_1(\mathbf{x}); \phi_2(\mathbf{x})]$ • Product: Takes tensor product of features: $\phi(\mathbf{x}) = \phi_1(\mathbf{x}) \otimes \phi_2(\mathbf{x})$ • Polynomial/Exponential: Implicitly creates higher-order feature interactions
Deriving the RBF Kernel
Kernel algebra provides a beautiful derivation of the Gaussian RBF kernel from the linear kernel. Starting with the squared exponential:
$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right)$$
Expanding the squared norm:
$$= \exp\left( -\gamma (|\mathbf{x}|^2 - 2\mathbf{x}^\top\mathbf{x}' + |\mathbf{x}'|^2) \right)$$
$$= \exp(-\gamma |\mathbf{x}|^2) \cdot \exp(2\gamma \mathbf{x}^\top\mathbf{x}') \cdot \exp(-\gamma |\mathbf{x}'|^2)$$
The middle term, $\exp(2\gamma \mathbf{x}^\top\mathbf{x}')$, is an exponential of the linear kernel—valid by the exponential rule. The outer terms are constant factors for each point (they can be absorbed into a normalized feature map).
Thus, the RBF kernel inherits its validity from the linear kernel through the exponential operation!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as np # Base kernelsdef linear_kernel(x, y): return np.dot(x, y) def poly_kernel(x, y, d=2, c=1): return (np.dot(x, y) + c) ** d def rbf_kernel(x, y, gamma=1.0): return np.exp(-gamma * np.sum((x - y)**2)) # Kernel algebra: combining kernelsdef kernel_sum(k1, k2): """k(x,y) = k1(x,y) + k2(x,y)""" return lambda x, y: k1(x, y) + k2(x, y) def kernel_product(k1, k2): """k(x,y) = k1(x,y) * k2(x,y)""" return lambda x, y: k1(x, y) * k2(x, y) def kernel_scale(k, c): """k(x,y) = c * k(x,y)""" return lambda x, y: c * k(x, y) # Example: Create a hybrid kernel# k(x,y) = 0.5 * rbf(x,y) + 0.3 * poly(x,y) + 0.2 * linear(x,y)hybrid = kernel_sum( kernel_scale(rbf_kernel, 0.5), kernel_sum( kernel_scale(lambda x, y: poly_kernel(x, y, 2, 1), 0.3), kernel_scale(linear_kernel, 0.2) )) # Verify on sample datanp.random.seed(42)X = np.random.randn(5, 3) # Build hybrid kernel matrixn = X.shape[0]K_hybrid = np.zeros((n, n))for i in range(n): for j in range(n): K_hybrid[i, j] = hybrid(X[i], X[j]) # Verify p.s.d.eigenvalues = np.linalg.eigvalsh(K_hybrid)print("Hybrid Kernel Matrix:")print(np.round(K_hybrid, 4))print(f"\nEigenvalues: {np.round(eigenvalues, 4)}")print(f"All eigenvalues ≥ 0? {np.all(eigenvalues >= -1e-10)}")print("\nThe hybrid kernel is valid (p.s.d.)!")The Representer Theorem is the mathematical result that justifies kernel methods. It tells us that even when optimizing over infinite-dimensional function spaces, the optimal solution can always be expressed in terms of the training data.
Statement (Simplified)
Consider minimizing a regularized empirical risk over a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$:
$$\min_{f \in \mathcal{H}} \left[ \sum_{i=1}^n L(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}}^2 \right]$$
where $L$ is any loss function and $|f|_{\mathcal{H}}$ is the RKHS norm. Then the minimizer has the form:
$$f^*(\mathbf{x}) = \sum_{i=1}^n \alpha_i k(\mathbf{x}_i, \mathbf{x})$$
for some coefficients $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$.
Implications
This is remarkable: the optimal function, which could in principle be any element of an infinite-dimensional space, is actually a finite linear combination of the kernel evaluated at training points.
The Representer Theorem converts an infinite-dimensional optimization problem into a finite-dimensional one. Instead of searching over all possible functions in the RKHS, we only need to find $n$ coefficients $\alpha_1, \ldots, \alpha_n$. This makes kernel methods computationally tractable.
Intuition for the Result
Why should the optimal function be expressible in terms of training points? Consider these observations:
Function values at training points determine the loss: The empirical risk only depends on $f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)$.
The regularizer penalizes "unnecessary complexity": Adding components orthogonal to the training data increases $|f|_{\mathcal{H}}^2$ without improving the loss.
The optimal balance: The minimizer uses exactly the components needed to fit the training data, without extra complexity.
Formally, we can decompose any function $f \in \mathcal{H}$ as: $$f = f_\parallel + f_\perp$$ where $f_\parallel$ lies in the span of ${k(\cdot, \mathbf{x}i)}{i=1}^n$ and $f_\perp$ is orthogonal to this span.
Since $f_\perp(\mathbf{x}i) = 0$ for all training points (by orthogonality), it doesn't affect the loss. But $|f|^2 = |f\parallel|^2 + |f_\perp|^2 \geq |f_\parallel|^2$, so the regularizer is strictly worse with $f_\perp \neq 0$.
Therefore, the optimal $f^$ has $f^\perp = 0$, meaning $f^* = f^*\parallel \in \text{span}{k(\cdot, \mathbf{x}_i)}$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as np def kernel_ridge_regression(K, y, lambda_reg): """ Solve kernel ridge regression. By the Representer Theorem, the solution is: f(x) = sum_i alpha_i * k(x_i, x) where alpha = (K + λI)^{-1} y Parameters: - K: (n, n) kernel matrix - y: (n,) targets - lambda_reg: regularization parameter Returns: - alpha: (n,) coefficients """ n = K.shape[0] alpha = np.linalg.solve(K + lambda_reg * np.eye(n), y) return alpha def predict(K_test, alpha): """ Make predictions using learned coefficients. K_test[i, j] = k(x_test_i, x_train_j) prediction[i] = sum_j alpha[j] * k(x_test_i, x_train_j) = K_test[i, :] @ alpha """ return K_test @ alpha # Example: nonlinear regression with RBF kernelnp.random.seed(42) # Training dataX_train = np.linspace(0, 4*np.pi, 20).reshape(-1, 1)y_train = np.sin(X_train.ravel()) + 0.1 * np.random.randn(20) # Test dataX_test = np.linspace(0, 4*np.pi, 100).reshape(-1, 1)y_test_true = np.sin(X_test.ravel()) # Compute kernel matricesgamma = 1.0def rbf_kernel_matrix(X1, X2, gamma): K = np.zeros((X1.shape[0], X2.shape[0])) for i in range(X1.shape[0]): for j in range(X2.shape[0]): K[i, j] = np.exp(-gamma * np.sum((X1[i] - X2[j])**2)) return K K_train = rbf_kernel_matrix(X_train, X_train, gamma)K_test = rbf_kernel_matrix(X_test, X_train, gamma) # Solve kernel ridge regressionlambda_reg = 0.01alpha = kernel_ridge_regression(K_train, y_train, lambda_reg) # Predictionsy_pred = predict(K_test, alpha) print(f"Number of training points (n): {len(X_train)}")print(f"Number of coefficients (alpha): {len(alpha)}")print(f"")print(f"By the Representer Theorem, f(x) = Σᵢ αᵢ k(xᵢ, x)")print(f"")print(f"Test MSE: {np.mean((y_pred - y_test_true)**2):.6f}")Beyond the formal mathematics, kernels can be understood intuitively as similarity functions with special properties. This perspective is valuable for designing kernels and understanding what assumptions they encode.
The Similarity Interpretation
When we choose a kernel $k$, we are implicitly stating: "I believe two inputs $\mathbf{x}$ and $\mathbf{x}'$ should contribute similarly to predictions if $k(\mathbf{x}, \mathbf{x}')$ is large."
Different kernels encode different notions of "similar":
| Kernel | What Makes x and x' Similar? |
|---|---|
| Linear: $\mathbf{x}^\top \mathbf{x}'$ | Correlated components, similar direction in feature space |
| Polynomial: $(\mathbf{x}^\top \mathbf{x}' + c)^p$ | Similar direction + matching interaction patterns |
| RBF: $\exp(-\gamma|\mathbf{x}-\mathbf{x}'|^2)$ | Close in Euclidean distance (locality) |
| Laplacian: $\exp(-\gamma|\mathbf{x}-\mathbf{x}'|_1)$ | Close in Manhattan distance (more robust to outliers) |
| Sigmoid: $\tanh(a \mathbf{x}^\top \mathbf{x}' + c)$ | Neural network-like response patterns |
The RBF Kernel's Locality
The RBF kernel $k(\mathbf{x}, \mathbf{x}') = \exp(-\gamma|\mathbf{x}-\mathbf{x}'|^2)$ has a particularly intuitive interpretation:
This local behavior makes RBF kernels excellent for interpolating smooth functions: predictions at a point $\mathbf{x}$ are dominated by nearby training points.
Choosing a kernel is analogous to choosing a prior in Bayesian inference. The kernel encodes our beliefs about the function space—how smooth functions should be, whether periodicity is expected, whether interactions matter. Poor kernel choice leads to poor generalization, just as a misspecified prior leads to poor posterior inference.
Normalized Kernels
For interpretation as pure similarity (not scaled by point "magnitudes"), we often use normalized kernels:
$$\tilde{k}(\mathbf{x}, \mathbf{x}') = \frac{k(\mathbf{x}, \mathbf{x}')}{\sqrt{k(\mathbf{x}, \mathbf{x}) \cdot k(\mathbf{x}', \mathbf{x}')}}$$
The normalized kernel satisfies:
The RBF kernel is already normalized: $k(\mathbf{x}, \mathbf{x}) = 1$ for all $\mathbf{x}$. The linear kernel is not, so normalized linear kernel becomes the cosine similarity.
We have developed a comprehensive understanding of kernel functions—the mathematical objects that make the kernel trick possible.
Core Concepts
You now have a solid foundation in kernel functions. In the next page, we will explore the computational advantage of kernels—how they enable algorithms to work efficiently in very high or infinite-dimensional feature spaces.