Loading learning content...
With a solid understanding of the kernel trick, feature spaces, and Mercer's theorem, we now turn to the practical question: What kernels should we actually use?
Over decades of research and application, certain kernel families have emerged as particularly useful. Each kernel encodes specific assumptions about the structure of the target function—its smoothness, periodicity, locality, or correlation patterns. Choosing the right kernel is both a science and an art: the science lies in understanding kernel properties, while the art involves matching these properties to domain knowledge.
This page provides a comprehensive survey of the most important kernel functions, their mathematical properties, and practical guidance for selection.
By the end of this page, you will be able to: • Describe the properties and use cases for major kernel families • Understand geometric and probabilistic interpretations of each kernel • Make informed kernel selections based on problem characteristics • Tune kernel hyperparameters effectively • Combine kernels to capture complex structure
The simplest kernel is the linear kernel—the identity that corresponds to no feature transformation at all.
Definition
$$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$$
Or with optional constant and scaling: $$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = c + \gamma \mathbf{x}^\top \mathbf{x}'$$
Feature Map: The identity $\phi(\mathbf{x}) = \mathbf{x}$ (for the basic form).
Properties
| Property | Value |
|---|---|
| Feature dimension | $d$ (input dimension) |
| Computational cost | $O(d)$ |
| Stationarity | Non-stationary |
| Universal | No |
| Characteristic | No |
The linear kernel is appropriate when:
• The relationship between features and target is genuinely linear • Feature dimension $d$ is very high (e.g., text with bag-of-words) where explicit features are already rich • You want to match the expressiveness of linear regression/SVM • Computational efficiency is paramount • Interpretability of weights is important
In high-dimensional sparse settings (NLP, genomics), linear kernels often work surprisingly well.
Geometric Interpretation
The linear kernel measures the cosine of the angle between vectors (when both have unit norm) or, more generally, the projection of one vector onto another scaled by magnitudes.
Two vectors $\mathbf{x}$ and $\mathbf{x}'$ have:
Connection to Linear Models
Using the linear kernel in kernel ridge regression gives: $$\boldsymbol{\alpha} = (\mathbf{X}\mathbf{X}^\top + \lambda \mathbf{I})^{-1} \mathbf{y}$$
which is equivalent to standard ridge regression (using the Woodbury identity). The kernel formulation offers no computational advantage here—it's included for completeness and to highlight that linear models are a special case of kernel methods.
Polynomial kernels extend linear kernels to capture nonlinear relationships while maintaining a finite-dimensional feature space.
Definition
$$k_{\text{poly}}(\mathbf{x}, \mathbf{x}') = (\gamma \mathbf{x}^\top \mathbf{x}' + c)^p$$
where:
Common variants:
Feature Map
The feature map consists of all monomials up to degree $p$. For example, with $p = 2$, $c = 1$, $d = 2$:
$$\phi(x_1, x_2) = (1, \sqrt{2}x_1, \sqrt{2}x_2, x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$
Feature dimension: $\binom{d + p}{p}$ for inhomogeneous; $\binom{d + p - 1}{p}$ for homogeneous.
| Property | Value |
|---|---|
| Feature dimension ($c > 0$) | $\binom{d + p}{p}$ |
| Computational cost | $O(d)$ |
| Stationarity | Non-stationary |
| Universal | No (for finite $p$) |
| Captures interactions | Yes, up to order $p$ |
Degree Selection
Numerical Stability
High-degree polynomial kernels can suffer from numerical issues:
Practical Advice: Polynomial kernels work well for NLP (natural language processing) where the input is already high-dimensional and sparse. Degree 2 or 3 is usually sufficient.
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as np def polynomial_kernel(x, y, degree=3, gamma=1.0, coef0=1.0): """ Compute polynomial kernel: (γ x·y + c)^d Parameters: degree: polynomial degree p gamma: scale parameter γ coef0: constant term c """ return (gamma * np.dot(x, y) + coef0) ** degree # Feature dimension for different degreesfrom math import comb print("Polynomial Feature Dimensions")print("=" * 50)print(f"{'d (input dim)':>12} {'degree':>8} {'D (feature dim)':>15}")print("-" * 50) for d in [10, 50, 100, 1000]: for p in [2, 3, 4, 5]: D = comb(d + p, p) print(f"{d:>12} {p:>8} {D:>15,}") # Kernel values for different choicesprint("\nKernel Value Sensitivity")print("=" * 50)x = np.array([1.0, 2.0, 3.0])y = np.array([0.5, 1.5, 2.5]) inner = np.dot(x, y)print(f"x · y = {inner}")print(f"") for degree in [1, 2, 3, 4, 5]: for coef0 in [0, 1]: k = polynomial_kernel(x, y, degree=degree, coef0=coef0) label = f"d={degree}, c={coef0}" print(f" {label:15} k(x,y) = {k:>15.2f}")The Gaussian Radial Basis Function (RBF) kernel is arguably the most important kernel in practice. It provides a universal, infinitely smooth similarity measure based on Euclidean distance.
Definition
$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right) = \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2} \right)$$
Two equivalent parameterizations:
Key Properties
| Property | Value |
|---|---|
| Feature dimension | Infinite |
| Computational cost | $O(d)$ |
| Stationarity | Stationary (translation-invariant) |
| Universal | Yes |
| Characteristic | Yes |
| Self-similarity | $k(\mathbf{x}, \mathbf{x}) = 1$ always |
| Range | $(0, 1]$ |
The RBF kernel is an excellent default choice when:
• You have no strong prior knowledge about the function structure • You expect smooth, continuous relationships • Local patterns matter (nearby points should have similar outputs) • Sample size is moderate (≤ 10,000 for exact kernel methods) • You're willing to tune the bandwidth parameter
"When in doubt, try RBF" is reasonable advice for kernel methods.
The Bandwidth Parameter $\gamma$ (or $\sigma$)
The bandwidth controls the locality of the kernel—how quickly similarity decays with distance:
Geometric Interpretation
The RBF kernel value decreases as a Gaussian function of distance:
Points beyond $3\sigma$ have negligible influence—the kernel is effectively local.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as np def rbf_kernel(x, y, gamma=1.0): """ Gaussian RBF kernel: exp(-γ ||x - y||²) """ sq_dist = np.sum((x - y)**2) return np.exp(-gamma * sq_dist) def rbf_kernel_matrix(X, gamma=1.0): """ Compute RBF kernel matrix efficiently using broadcasting. """ sq_norms = np.sum(X**2, axis=1) sq_dists = sq_norms[:, None] + sq_norms[None, :] - 2 * X @ X.T return np.exp(-gamma * sq_dists) # Demonstrate effect of gammanp.random.seed(42)n = 5X = np.random.randn(n, 2) print("RBF Kernel Matrices for Different γ Values")print("=" * 60) for gamma in [0.1, 1.0, 10.0, 100.0]: K = rbf_kernel_matrix(X, gamma) print(f"\nγ = {gamma}") print(f" Kernel matrix (first 3x3):") print(f" {K[:3, :3].round(4)}") print(f" Off-diagonal range: [{K[K != 1].min():.4f}, {K[K != 1].max():.4f}]") # Effective neighborhood size threshold = 0.01 # Consider neighbors with k > 0.01 avg_neighbors = np.mean(np.sum(K > threshold, axis=1)) print(f" Avg neighbors (k > 0.01): {avg_neighbors:.1f}") # Heuristic for gamma selectionprint("\nGamma Selection Heuristics")print("=" * 60)print(f"Median heuristic: γ = 1 / (2 × median(||xᵢ - xⱼ||²))")pairwise_sq_dists = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)median_sq_dist = np.median(pairwise_sq_dists[pairwise_sq_dists > 0])gamma_median = 1 / (2 * median_sq_dist)print(f" For this data: γ = {gamma_median:.4f}")The Matérn family of kernels generalizes both the Gaussian RBF and Laplacian kernels, providing control over the smoothness of the target function.
Matérn Kernel
$$k_{\text{Matérn}}(\mathbf{x}, \mathbf{x}') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} |\mathbf{x} - \mathbf{x}'|}{\ell} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu} |\mathbf{x} - \mathbf{x}'|}{\ell} \right)$$
where:
Special Cases
| $\nu$ | Kernel Name | Differentiability | Formula |
|---|---|---|---|
| $\frac{1}{2}$ | Laplacian / Exponential | Not differentiable | $\exp(-r/\ell)$ |
| $\frac{3}{2}$ | Matérn 3/2 | Once differentiable | $(1 + \sqrt{3}r/\ell) \exp(-\sqrt{3}r/\ell)$ |
| $\frac{5}{2}$ | Matérn 5/2 | Twice differentiable | $(1 + \sqrt{5}r/\ell + 5r^2/3\ell^2) \exp(-\sqrt{5}r/\ell)$ |
| $\infty$ | Gaussian RBF | Infinitely differentiable | $\exp(-r^2/2\ell^2)$ |
where $r = |\mathbf{x} - \mathbf{x}'|$.
The smoothness parameter $\nu$ controls how differentiable functions in the RKHS are:
• Functions in a Matérn-$\nu$ RKHS are $\lceil \nu \rceil - 1$ times differentiable • Gaussian RBF ($\nu = \infty$) gives infinitely smooth functions—sometimes unrealistically smooth! • For real-world data with some roughness, Matérn-3/2 or 5/2 often works better than RBF
The Laplacian Kernel (Matérn-1/2)
$$k_{\text{Laplacian}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'| \right)$$
Note: uses the L2 norm (not squared as in RBF).
Properties:
When to Use:
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np def laplacian_kernel(x, y, gamma=1.0): """Laplacian kernel: exp(-γ ||x - y||)""" dist = np.sqrt(np.sum((x - y)**2)) return np.exp(-gamma * dist) def matern_12(r, length_scale=1.0): """Matérn ν=1/2 (Laplacian)""" return np.exp(-r / length_scale) def matern_32(r, length_scale=1.0): """Matérn ν=3/2""" scaled = np.sqrt(3) * r / length_scale return (1 + scaled) * np.exp(-scaled) def matern_52(r, length_scale=1.0): """Matérn ν=5/2""" scaled = np.sqrt(5) * r / length_scale return (1 + scaled + scaled**2 / 3) * np.exp(-scaled) def rbf_from_r(r, length_scale=1.0): """Gaussian RBF (Matérn ν=∞)""" return np.exp(-0.5 * (r / length_scale)**2) # Compare kernel shapesprint("Matérn Family Comparison (length_scale = 1)")print("=" * 60)print(f"{'Distance r':>12} {'ν=1/2':>10} {'ν=3/2':>10} {'ν=5/2':>10} {'ν=∞(RBF)':>10}")print("-" * 60) for r in [0, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0]: k12 = matern_12(r) k32 = matern_32(r) k52 = matern_52(r) krbf = rbf_from_r(r) print(f"{r:>12.1f} {k12:>10.4f} {k32:>10.4f} {k52:>10.4f} {krbf:>10.4f}") print("\nObservation: Smaller ν → heavier tails (slower decay)")print("Smaller ν → rougher functions in RKHS")Beyond the standard kernels, specialized kernels encode domain-specific structure.
Periodic Kernel
$$k_{\text{periodic}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2} \right)$$
where $p$ is the period and $\ell$ is the length scale.
Properties:
Rational Quadratic Kernel
$$k_{\text{RQ}}(\mathbf{x}, \mathbf{x}') = \left( 1 + \frac{|\mathbf{x} - \mathbf{x}'|^2}{2\alpha\ell^2} \right)^{-\alpha}$$
The RQ kernel is an infinite mixture of RBF kernels with different length scales. The parameter $\alpha$ controls the mixture:
The sigmoid kernel $\tanh(\gamma \mathbf{x}^\top \mathbf{x}' + c)$ is NOT positive semi-definite for all parameter values! It is only valid for certain ranges of $\gamma$ and $c$ depending on the data. Use with extreme caution or prefer true neural network models instead.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import numpy as np def periodic_kernel(x, y, length_scale=1.0, period=1.0): """Periodic kernel for cyclical patterns.""" dist = np.abs(x - y) return np.exp(-2 * np.sin(np.pi * dist / period)**2 / length_scale**2) def rational_quadratic_kernel(x, y, alpha=1.0, length_scale=1.0): """Rational quadratic kernel (infinite mixture of RBFs).""" sq_dist = np.sum((x - y)**2) return (1 + sq_dist / (2 * alpha * length_scale**2))**(-alpha) def locally_periodic_kernel(x, y, length_scale=1.0, period=1.0, decay=1.0): """ Locally periodic: periodic × RBF Captures decaying periodic patterns. """ periodic = periodic_kernel(x, y, length_scale, period) rbf = np.exp(-np.sum((x - y)**2) / (2 * decay**2)) return periodic * rbf # Demonstrate periodic kernelprint("Periodic Kernel Demo (period=2π)")print("=" * 50) period = 2 * np.pifor delta in [0, np.pi/4, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 5*np.pi/2]: x, y = np.array([0.0]), np.array([delta]) k = periodic_kernel(x, y, length_scale=1.0, period=period) print(f" |x - y| = {delta/np.pi:.2f}π → k(x, y) = {k:.4f}") print("\nNote: k(0, 2π) = k(0, 0) = 1 (exact period)") # Compare RQ kernel for different alphaprint("\nRational Quadratic vs RBF Comparison")print("=" * 50)print(f"{'Distance':>10} {'RBF':>10} {'RQ α=1':>10} {'RQ α=10':>10} {'RQ α=0.1':>10}")print("-" * 50) for r in [0, 0.5, 1.0, 2.0, 5.0]: x, y = np.array([0.0]), np.array([r]) rbf = np.exp(-r**2 / 2) rq1 = rational_quadratic_kernel(x, y, alpha=1.0) rq10 = rational_quadratic_kernel(x, y, alpha=10.0) rq01 = rational_quadratic_kernel(x, y, alpha=0.1) print(f"{r:>10.1f} {rbf:>10.4f} {rq1:>10.4f} {rq10:>10.4f} {rq01:>10.4f}") print("\nSmall α: heavier tails (retains similarity at large distances)")Complex real-world phenomena often exhibit multiple types of structure: smooth trends, periodic oscillations, and local variations. Kernel composition allows us to build kernels that capture these composite patterns.
Kernel Algebra Recap
Valid kernels can be combined to form new valid kernels:
Sum: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$
Product: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$
Tensor/Direct Sum (for multi-dimensional inputs):
A common pattern for time series:
$k = k_{\text{trend}} + k_{\text{seasonal}} + k_{\text{noise}}$
$= \text{RBF}(\text{long length-scale}) + \text{Periodic} \times \text{RBF}(\text{decay}) + \text{White noise}$
This automatically decomposes the signal into interpretable components!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np class CompositeKernel: """ Build composite kernels from basic kernels. """ def __init__(self, kernel_func): self.kernel_func = kernel_func def __call__(self, x, y): return self.kernel_func(x, y) def __add__(self, other): """Sum of kernels: k1 + k2""" return CompositeKernel(lambda x, y: self(x, y) + other(x, y)) def __mul__(self, other): """Product of kernels: k1 * k2""" if isinstance(other, (int, float)): return CompositeKernel(lambda x, y: other * self(x, y)) return CompositeKernel(lambda x, y: self(x, y) * other(x, y)) def __rmul__(self, other): return self.__mul__(other) # Define base kernelsdef make_rbf(length_scale): return CompositeKernel( lambda x, y: np.exp(-np.sum((x-y)**2) / (2*length_scale**2)) ) def make_periodic(period, length_scale): return CompositeKernel( lambda x, y: np.exp(-2 * np.sin(np.pi * np.abs(x-y) / period)**2 / length_scale**2) ) def make_linear(): return CompositeKernel(lambda x, y: np.dot(x, y)) def make_white_noise(variance): return CompositeKernel( lambda x, y: variance if np.allclose(x, y) else 0 ) # Build a composite kernel for time series# Pattern: long-term trend + decaying seasonality + noise trend = make_rbf(length_scale=10.0) # Long-range smooth trendseasonal = make_periodic(period=1.0, length_scale=0.5) * make_rbf(length_scale=5.0)noise = make_white_noise(variance=0.1) time_series_kernel = 1.0 * trend + 0.5 * seasonal + noise # Evaluate the composite kernelprint("Composite Time Series Kernel")print("=" * 60)print("k = 1.0 × RBF(ℓ=10) + 0.5 × (Periodic(p=1) × RBF(ℓ=5)) + WhiteNoise(σ²=0.1)")print("") # Kernel matrix for sample pointst = np.linspace(0, 5, 10).reshape(-1, 1)K = np.zeros((10, 10))for i in range(10): for j in range(10): K[i, j] = time_series_kernel(t[i], t[j]) print("Sample kernel matrix (first 5×5):")print(K[:5, :5].round(3))Choosing the right kernel is a crucial modeling decision. Here we distill practical guidelines based on problem characteristics.
Decision Framework
| Problem Characteristic | Recommended Kernel(s) | Reasoning |
|---|---|---|
| Genuinely linear relationship | Linear | Simplest, most interpretable |
| Unknown smooth function | Gaussian RBF | Universal, flexible, good default |
| Function with abrupt changes | Laplacian or Matérn-1/2 | Non-differentiable functions |
| Moderately rough function | Matérn-3/2 or 5/2 | Finite differentiability |
| Periodic patterns | Periodic × RBF | Captures cyclical behavior with decay |
| Multi-scale patterns | Rational Quadratic | Infinite mixture of length scales |
| High-dimensional sparse data | Linear or low-degree Polynomial | Avoids curse of dimensionality |
| Feature interactions matter | Polynomial (degree 2-3) | Explicit interaction terms |
| Text/sequence data | String kernels | Structure-aware comparison |
| Unknown structure | Multiple kernel learning | Learn the right combination |
Common hyperparameter defaults and tuning ranges:
RBF γ: Start with $1/(d \cdot \text{Var}(X))$, search $[10^{-4}, 10^4]$ on log scale Polynomial degree: Try 2, 3, rarely higher Regularization λ: Search $[10^{-8}, 10^2]$ on log scale Matérn ν: Usually fix at 3/2 or 5/2; rarely tune
Grid search or Bayesian optimization work well for kernel hyperparameter tuning.
We have surveyed the major families of kernel functions, understanding their mathematical properties, geometric interpretations, and practical use cases.
The Kernel Landscape
Congratulations! You have completed Module 1: The Kernel Trick. You now have a comprehensive understanding of:
• Why we need feature space mapping for nonlinear problems • How kernels compute feature-space inner products implicitly • The computational advantages of the kernel trick • Mercer's theorem and the theory of valid kernels • The major kernel families and how to choose among them
In the next module, we will apply these concepts to Kernel Ridge Regression—combining kernels with regularized least squares for powerful nonlinear regression.