Machine LearningKernel Methods for Regression

The Kernel Trick

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

5 / 5

Common Kernels

The Kernel Toolkit

With a solid understanding of the kernel trick, feature spaces, and Mercer's theorem, we now turn to the practical question: What kernels should we actually use?

Over decades of research and application, certain kernel families have emerged as particularly useful. Each kernel encodes specific assumptions about the structure of the target function—its smoothness, periodicity, locality, or correlation patterns. Choosing the right kernel is both a science and an art: the science lies in understanding kernel properties, while the art involves matching these properties to domain knowledge.

This page provides a comprehensive survey of the most important kernel functions, their mathematical properties, and practical guidance for selection.

What You Will Learn

By the end of this page, you will be able to: • Describe the properties and use cases for major kernel families • Understand geometric and probabilistic interpretations of each kernel • Make informed kernel selections based on problem characteristics • Tune kernel hyperparameters effectively • Combine kernels to capture complex structure

The Linear Kernel

The simplest kernel is the linear kernel—the identity that corresponds to no feature transformation at all.

Definition

$$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$$

Or with optional constant and scaling: $$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = c + \gamma \mathbf{x}^\top \mathbf{x}'$$

Feature Map: The identity $\phi(\mathbf{x}) = \mathbf{x}$ (for the basic form).

Properties

Property	Value
Feature dimension	$d$ (input dimension)
Computational cost	$O(d)$
Stationarity	Non-stationary
Universal	No
Characteristic	No

When to Use the Linear Kernel

The linear kernel is appropriate when:

• The relationship between features and target is genuinely linear • Feature dimension $d$ is very high (e.g., text with bag-of-words) where explicit features are already rich • You want to match the expressiveness of linear regression/SVM • Computational efficiency is paramount • Interpretability of weights is important

In high-dimensional sparse settings (NLP, genomics), linear kernels often work surprisingly well.

Geometric Interpretation

The linear kernel measures the cosine of the angle between vectors (when both have unit norm) or, more generally, the projection of one vector onto another scaled by magnitudes.

Two vectors $\mathbf{x}$ and $\mathbf{x}'$ have:

High kernel value if they point in similar directions
Zero kernel value if they are orthogonal
Negative kernel value if they point in opposite directions

Connection to Linear Models

Using the linear kernel in kernel ridge regression gives: $$\boldsymbol{\alpha} = (\mathbf{X}\mathbf{X}^\top + \lambda \mathbf{I})^{-1} \mathbf{y}$$

which is equivalent to standard ridge regression (using the Woodbury identity). The kernel formulation offers no computational advantage here—it's included for completeness and to highlight that linear models are a special case of kernel methods.

Polynomial Kernels

Polynomial kernels extend linear kernels to capture nonlinear relationships while maintaining a finite-dimensional feature space.

Definition

$$k_{\text{poly}}(\mathbf{x}, \mathbf{x}') = (\gamma \mathbf{x}^\top \mathbf{x}' + c)^p$$

where:

$p \geq 1$ is the polynomial degree
$c \geq 0$ is the coefficient (often 0 or 1)
$\gamma > 0$ is a scale parameter (often 1)

Common variants:

Homogeneous: $c = 0$, only degree-$p$ terms
Inhomogeneous: $c > 0$, includes lower-degree terms

Feature Map

The feature map consists of all monomials up to degree $p$. For example, with $p = 2$, $c = 1$, $d = 2$:

$$\phi(x_1, x_2) = (1, \sqrt{2}x_1, \sqrt{2}x_2, x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$

Feature dimension: $\binom{d + p}{p}$ for inhomogeneous; $\binom{d + p - 1}{p}$ for homogeneous.

Polynomial Kernel Properties
Property	Value
Feature dimension ($c > 0$)	$\binom{d + p}{p}$
Computational cost	$O(d)$
Stationarity	Non-stationary
Universal	No (for finite $p$)
Captures interactions	Yes, up to order $p$

Degree Selection

Degree 1: Equivalent to linear kernel (with offset if $c > 0$)
Degree 2: Captures pairwise interactions ($x_i x_j$ terms); widely used
Degree 3-5: Captures higher-order interactions; risk of overfitting
High degree ($p > 10$): Rarely used; often better to use RBF

Numerical Stability

High-degree polynomial kernels can suffer from numerical issues:

Values grow very large for $|\mathbf{x}^\top \mathbf{x}'| > 1$
Vanish for $|\mathbf{x}^\top \mathbf{x}'| < 1$
Input normalization strongly recommended

Practical Advice: Polynomial kernels work well for NLP (natural language processing) where the input is already high-dimensional and sparse. Degree 2 or 3 is usually sufficient.

polynomial_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
def polynomial_kernel(x, y, degree=3, gamma=1.0, coef0=1.0):
    """
    Compute polynomial kernel: (γ x·y + c)^d
    
    Parameters:
        degree: polynomial degree p
        gamma: scale parameter γ
        coef0: constant term c
    """
    return (gamma * np.dot(x, y) + coef0) ** degree
 
# Feature dimension for different degrees
from math import comb
 
print("Polynomial Feature Dimensions")
print("=" * 50)
print(f"{'d (input dim)':>12} {'degree':>8} {'D (feature dim)':>15}")
print("-" * 50)
 
for d in [10, 50, 100, 1000]:
    for p in [2, 3, 4, 5]:
        D = comb(d + p, p)
        print(f"{d:>12} {p:>8} {D:>15,}")
 
# Kernel values for different choices
print("\nKernel Value Sensitivity")
print("=" * 50)
x = np.array([1.0, 2.0, 3.0])
y = np.array([0.5, 1.5, 2.5])
 
inner = np.dot(x, y)
print(f"x · y = {inner}")
print(f"")
 
for degree in [1, 2, 3, 4, 5]:
    for coef0 in [0, 1]:
        k = polynomial_kernel(x, y, degree=degree, coef0=coef0)
        label = f"d={degree}, c={coef0}"
        print(f"  {label:15} k(x,y) = {k:>15.2f}")

The Gaussian RBF Kernel

The Gaussian Radial Basis Function (RBF) kernel is arguably the most important kernel in practice. It provides a universal, infinitely smooth similarity measure based on Euclidean distance.

Definition

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right) = \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2} \right)$$

Two equivalent parameterizations:

Gamma form: $\gamma = \frac{1}{2\sigma^2}$ (scikit-learn convention)
Length-scale form: $\sigma$ is the characteristic length scale

Key Properties

Property	Value
Feature dimension	Infinite
Computational cost	$O(d)$
Stationarity	Stationary (translation-invariant)
Universal	Yes
Characteristic	Yes
Self-similarity	$k(\mathbf{x}, \mathbf{x}) = 1$ always
Range	$(0, 1]$

When to Use the RBF Kernel

The RBF kernel is an excellent default choice when:

• You have no strong prior knowledge about the function structure • You expect smooth, continuous relationships • Local patterns matter (nearby points should have similar outputs) • Sample size is moderate (≤ 10,000 for exact kernel methods) • You're willing to tune the bandwidth parameter

"When in doubt, try RBF" is reasonable advice for kernel methods.

The Bandwidth Parameter $\gamma$ (or $\sigma$)

The bandwidth controls the locality of the kernel—how quickly similarity decays with distance:

Large $\gamma$ (small $\sigma$): Tight, localized bumps. Each training point mainly affects nearby predictions. Risk of overfitting.
Small $\gamma$ (large $\sigma$): Broad, overlapping bumps. Training points influence distant predictions. Risk of underfitting.
Rule of thumb: Start with $\gamma = \frac{1}{d \cdot \text{Var}(X)}$ (inverse of average squared distance)

Geometric Interpretation

The RBF kernel value decreases as a Gaussian function of distance:

At distance 0: $k = 1$
At distance $\sigma$: $k = e^{-0.5} \approx 0.61$
At distance $2\sigma$: $k = e^{-2} \approx 0.14$
At distance $3\sigma$: $k = e^{-4.5} \appro 0.01$

Points beyond $3\sigma$ have negligible influence—the kernel is effectively local.

rbf_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def rbf_kernel(x, y, gamma=1.0):
    """
    Gaussian RBF kernel: exp(-γ ||x - y||²)
    """
    sq_dist = np.sum((x - y)**2)
    return np.exp(-gamma * sq_dist)
 
def rbf_kernel_matrix(X, gamma=1.0):
    """
    Compute RBF kernel matrix efficiently using broadcasting.
    """
    sq_norms = np.sum(X**2, axis=1)
    sq_dists = sq_norms[:, None] + sq_norms[None, :] - 2 * X @ X.T
    return np.exp(-gamma * sq_dists)
 
# Demonstrate effect of gamma
np.random.seed(42)
n = 5
X = np.random.randn(n, 2)
 
print("RBF Kernel Matrices for Different γ Values")
print("=" * 60)
 
for gamma in [0.1, 1.0, 10.0, 100.0]:
    K = rbf_kernel_matrix(X, gamma)
    print(f"\nγ = {gamma}")
    print(f"  Kernel matrix (first 3x3):")
    print(f"    {K[:3, :3].round(4)}")
    print(f"  Off-diagonal range: [{K[K != 1].min():.4f}, {K[K != 1].max():.4f}]")
    
    # Effective neighborhood size
    threshold = 0.01  # Consider neighbors with k > 0.01
    avg_neighbors = np.mean(np.sum(K > threshold, axis=1))
    print(f"  Avg neighbors (k > 0.01): {avg_neighbors:.1f}")
 
# Heuristic for gamma selection
print("\nGamma Selection Heuristics")
print("=" * 60)
print(f"Median heuristic: γ = 1 / (2 × median(||xᵢ - xⱼ||²))")
pairwise_sq_dists = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
median_sq_dist = np.median(pairwise_sq_dists[pairwise_sq_dists > 0])
gamma_median = 1 / (2 * median_sq_dist)
print(f"  For this data: γ = {gamma_median:.4f}")

Laplacian and Matérn Kernels

The Matérn family of kernels generalizes both the Gaussian RBF and Laplacian kernels, providing control over the smoothness of the target function.

Matérn Kernel

$$k_{\text{Matérn}}(\mathbf{x}, \mathbf{x}') = \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} |\mathbf{x} - \mathbf{x}'|}{\ell} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu} |\mathbf{x} - \mathbf{x}'|}{\ell} \right)$$

where:

$\nu > 0$ is the smoothness parameter
$\ell > 0$ is the length scale
$K_\nu$ is the modified Bessel function of the second kind

Special Cases

$\nu$	Kernel Name	Differentiability	Formula
$\frac{1}{2}$	Laplacian / Exponential	Not differentiable	$\exp(-r/\ell)$
$\frac{3}{2}$	Matérn 3/2	Once differentiable	$(1 + \sqrt{3}r/\ell) \exp(-\sqrt{3}r/\ell)$
$\frac{5}{2}$	Matérn 5/2	Twice differentiable	$(1 + \sqrt{5}r/\ell + 5r^2/3\ell^2) \exp(-\sqrt{5}r/\ell)$
$\infty$	Gaussian RBF	Infinitely differentiable	$\exp(-r^2/2\ell^2)$

where $r = |\mathbf{x} - \mathbf{x}'|$.

Smoothness Matters

The smoothness parameter $\nu$ controls how differentiable functions in the RKHS are:

• Functions in a Matérn-$\nu$ RKHS are $\lceil \nu \rceil - 1$ times differentiable • Gaussian RBF ($\nu = \infty$) gives infinitely smooth functions—sometimes unrealistically smooth! • For real-world data with some roughness, Matérn-3/2 or 5/2 often works better than RBF

The Laplacian Kernel (Matérn-1/2)

$$k_{\text{Laplacian}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'| \right)$$

Note: uses the L2 norm (not squared as in RBF).

Properties:

Produces functions with kinks (not differentiable)
More robust to outliers than RBF
Decays exponentially rather than Gaussian-ly
Heavier tails: distant points retain more influence

When to Use:

Data with abrupt changes or discontinuities
Piecewise linear relationships
When the RBF seems "too smooth"
Financial time series, certain physical phenomena

matern_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def laplacian_kernel(x, y, gamma=1.0):
    """Laplacian kernel: exp(-γ ||x - y||)"""
    dist = np.sqrt(np.sum((x - y)**2))
    return np.exp(-gamma * dist)
 
def matern_12(r, length_scale=1.0):
    """Matérn ν=1/2 (Laplacian)"""
    return np.exp(-r / length_scale)
 
def matern_32(r, length_scale=1.0):
    """Matérn ν=3/2"""
    scaled = np.sqrt(3) * r / length_scale
    return (1 + scaled) * np.exp(-scaled)
 
def matern_52(r, length_scale=1.0):
    """Matérn ν=5/2"""
    scaled = np.sqrt(5) * r / length_scale
    return (1 + scaled + scaled**2 / 3) * np.exp(-scaled)
 
def rbf_from_r(r, length_scale=1.0):
    """Gaussian RBF (Matérn ν=∞)"""
    return np.exp(-0.5 * (r / length_scale)**2)
 
# Compare kernel shapes
print("Matérn Family Comparison (length_scale = 1)")
print("=" * 60)
print(f"{'Distance r':>12} {'ν=1/2':>10} {'ν=3/2':>10} {'ν=5/2':>10} {'ν=∞(RBF)':>10}")
print("-" * 60)
 
for r in [0, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0]:
    k12 = matern_12(r)
    k32 = matern_32(r)
    k52 = matern_52(r)
    krbf = rbf_from_r(r)
    print(f"{r:>12.1f} {k12:>10.4f} {k32:>10.4f} {k52:>10.4f} {krbf:>10.4f}")
 
print("\nObservation: Smaller ν → heavier tails (slower decay)")
print("Smaller ν → rougher functions in RKHS")

Periodic and Specialized Kernels

Beyond the standard kernels, specialized kernels encode domain-specific structure.

Periodic Kernel

$$k_{\text{periodic}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2} \right)$$

where $p$ is the period and $\ell$ is the length scale.

Properties:

$k(x, x') = k(x + p, x' + p)$ — periodic in its arguments
Perfect for seasonal data, cyclical phenomena
Often combined with other kernels for quasi-periodic patterns

Rational Quadratic Kernel

$$k_{\text{RQ}}(\mathbf{x}, \mathbf{x}') = \left( 1 + \frac{|\mathbf{x} - \mathbf{x}'|^2}{2\alpha\ell^2} \right)^{-\alpha}$$

The RQ kernel is an infinite mixture of RBF kernels with different length scales. The parameter $\alpha$ controls the mixture:

Large $\alpha$: approaches RBF
Small $\alpha$: heavy-tailed, multi-scale behavior

Other Specialized Kernels

•Spectral Mixture Kernel: Mixture of cosines with learned frequencies. Automatic discovery of periodic components.
•ANOVA Kernel: $k(\mathbf{x}, \mathbf{x}') = \sum_{i=1}^d k_1(x_i, x'_i)$. Additive structure, assumes no interactions.
•Sigmoid Kernel: $k(\mathbf{x}, \mathbf{x}') = \tanh(\gamma \mathbf{x}^\top \mathbf{x}' + c)$. Neural network-like. Not always valid! Use with care.
•String Kernels: Compare sequences (DNA, text) by substring matches. $O(\ell^2)$ computation where $\ell$ is string length.
•Graph Kernels: Compare graph structures via random walks, subtree patterns, etc.
•Histogram Intersection Kernel: $k(\mathbf{x}, \mathbf{x}') = \sum_i \min(x_i, x'_i)$. For histogram data (computer vision).

Sigmoid Kernel Warning

The sigmoid kernel $\tanh(\gamma \mathbf{x}^\top \mathbf{x}' + c)$ is NOT positive semi-definite for all parameter values! It is only valid for certain ranges of $\gamma$ and $c$ depending on the data. Use with extreme caution or prefer true neural network models instead.

specialized_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def periodic_kernel(x, y, length_scale=1.0, period=1.0):
    """Periodic kernel for cyclical patterns."""
    dist = np.abs(x - y)
    return np.exp(-2 * np.sin(np.pi * dist / period)**2 / length_scale**2)
 
def rational_quadratic_kernel(x, y, alpha=1.0, length_scale=1.0):
    """Rational quadratic kernel (infinite mixture of RBFs)."""
    sq_dist = np.sum((x - y)**2)
    return (1 + sq_dist / (2 * alpha * length_scale**2))**(-alpha)
 
def locally_periodic_kernel(x, y, length_scale=1.0, period=1.0, decay=1.0):
    """
    Locally periodic: periodic × RBF
    Captures decaying periodic patterns.
    """
    periodic = periodic_kernel(x, y, length_scale, period)
    rbf = np.exp(-np.sum((x - y)**2) / (2 * decay**2))
    return periodic * rbf
 
# Demonstrate periodic kernel
print("Periodic Kernel Demo (period=2π)")
print("=" * 50)
 
period = 2 * np.pi
for delta in [0, np.pi/4, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 5*np.pi/2]:
    x, y = np.array([0.0]), np.array([delta])
    k = periodic_kernel(x, y, length_scale=1.0, period=period)
    print(f"  |x - y| = {delta/np.pi:.2f}π  →  k(x, y) = {k:.4f}")
 
print("\nNote: k(0, 2π) = k(0, 0) = 1 (exact period)")
 
# Compare RQ kernel for different alpha
print("\nRational Quadratic vs RBF Comparison")
print("=" * 50)
print(f"{'Distance':>10} {'RBF':>10} {'RQ α=1':>10} {'RQ α=10':>10} {'RQ α=0.1':>10}")
print("-" * 50)
 
for r in [0, 0.5, 1.0, 2.0, 5.0]:
    x, y = np.array([0.0]), np.array([r])
    rbf = np.exp(-r**2 / 2)
    rq1 = rational_quadratic_kernel(x, y, alpha=1.0)
    rq10 = rational_quadratic_kernel(x, y, alpha=10.0)
    rq01 = rational_quadratic_kernel(x, y, alpha=0.1)
    print(f"{r:>10.1f} {rbf:>10.4f} {rq1:>10.4f} {rq10:>10.4f} {rq01:>10.4f}")
 
print("\nSmall α: heavier tails (retains similarity at large distances)")

Combining Kernels

Complex real-world phenomena often exhibit multiple types of structure: smooth trends, periodic oscillations, and local variations. Kernel composition allows us to build kernels that capture these composite patterns.

Kernel Algebra Recap

Valid kernels can be combined to form new valid kernels:

Sum: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$
- Interpretation: Model is sum of independent components
- Example: Trend + Seasonality
Product: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$
- Interpretation: Both similarities must be high
- Example: Locally periodic (periodic × decay)
Tensor/Direct Sum (for multi-dimensional inputs):
- Additive: $k(\mathbf{x}, \mathbf{x}') = \sum_d k_d(x_d, x'_d)$ — no interactions
- Product: $k(\mathbf{x}, \mathbf{x}') = \prod_d k_d(x_d, x'_d)$ — separable
- ARD (Automatic Relevance Determination): $k(\mathbf{x}, \mathbf{x}') = \exp\left(-\sum_d \frac{(x_d - x'_d)^2}{\ell_d^2}\right)$

Compositional Kernels in Practice

A common pattern for time series:

$k = k_{\text{trend}} + k_{\text{seasonal}} + k_{\text{noise}}$

$= \text{RBF}(\text{long length-scale}) + \text{Periodic} \times \text{RBF}(\text{decay}) + \text{White noise}$

This automatically decomposes the signal into interpretable components!

kernel_composition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
class CompositeKernel:
    """
    Build composite kernels from basic kernels.
    """
    def __init__(self, kernel_func):
        self.kernel_func = kernel_func
    
    def __call__(self, x, y):
        return self.kernel_func(x, y)
    
    def __add__(self, other):
        """Sum of kernels: k1 + k2"""
        return CompositeKernel(lambda x, y: self(x, y) + other(x, y))
    
    def __mul__(self, other):
        """Product of kernels: k1 * k2"""
        if isinstance(other, (int, float)):
            return CompositeKernel(lambda x, y: other * self(x, y))
        return CompositeKernel(lambda x, y: self(x, y) * other(x, y))
    
    def __rmul__(self, other):
        return self.__mul__(other)
 
# Define base kernels
def make_rbf(length_scale):
    return CompositeKernel(
        lambda x, y: np.exp(-np.sum((x-y)**2) / (2*length_scale**2))
    )
 
def make_periodic(period, length_scale):
    return CompositeKernel(
        lambda x, y: np.exp(-2 * np.sin(np.pi * np.abs(x-y) / period)**2 / length_scale**2)
    )
 
def make_linear():
    return CompositeKernel(lambda x, y: np.dot(x, y))
 
def make_white_noise(variance):
    return CompositeKernel(
        lambda x, y: variance if np.allclose(x, y) else 0
    )
 
# Build a composite kernel for time series
# Pattern: long-term trend + decaying seasonality + noise
 
trend = make_rbf(length_scale=10.0)  # Long-range smooth trend
seasonal = make_periodic(period=1.0, length_scale=0.5) * make_rbf(length_scale=5.0)
noise = make_white_noise(variance=0.1)
 
time_series_kernel = 1.0 * trend + 0.5 * seasonal + noise
 
# Evaluate the composite kernel
print("Composite Time Series Kernel")
print("=" * 60)
print("k = 1.0 × RBF(ℓ=10) + 0.5 × (Periodic(p=1) × RBF(ℓ=5)) + WhiteNoise(σ²=0.1)")
print("")
 
# Kernel matrix for sample points
t = np.linspace(0, 5, 10).reshape(-1, 1)
K = np.zeros((10, 10))
for i in range(10):
    for j in range(10):
        K[i, j] = time_series_kernel(t[i], t[j])
 
print("Sample kernel matrix (first 5×5):")
print(K[:5, :5].round(3))

Kernel Selection Guidelines

Choosing the right kernel is a crucial modeling decision. Here we distill practical guidelines based on problem characteristics.

Decision Framework

Kernel Selection Based on Problem Characteristics
Problem Characteristic	Recommended Kernel(s)	Reasoning
Genuinely linear relationship	Linear	Simplest, most interpretable
Unknown smooth function	Gaussian RBF	Universal, flexible, good default
Function with abrupt changes	Laplacian or Matérn-1/2	Non-differentiable functions
Moderately rough function	Matérn-3/2 or 5/2	Finite differentiability
Periodic patterns	Periodic × RBF	Captures cyclical behavior with decay
Multi-scale patterns	Rational Quadratic	Infinite mixture of length scales
High-dimensional sparse data	Linear or low-degree Polynomial	Avoids curse of dimensionality
Feature interactions matter	Polynomial (degree 2-3)	Explicit interaction terms
Text/sequence data	String kernels	Structure-aware comparison
Unknown structure	Multiple kernel learning	Learn the right combination

Practical Tips for Kernel Selection

•Start with RBF: It's a reasonable default for most continuous problems. If it doesn't work, that tells you something about the data structure.
•Visualize kernel matrices: A good kernel should show structure that correlates with the target. Random-looking $\mathbf{K}$ matrices indicate poor fit.
•Cross-validate hyperparameters: Kernel parameters (bandwidth, degree) should be tuned via cross-validation, not guessing.
•Consider input preprocessing: Standardize inputs before applying kernels. Different feature scales dramatically affect distance-based kernels.
•Use domain knowledge: If you know the function is periodic, use a periodic kernel. Don't expect RBF to discover periodicity efficiently.
•Watch for overfitting: Kernels with many hyperparameters can overfit the validation set. Use nested cross-validation for fair comparison.

Hyperparameter Sensitivity

Common hyperparameter defaults and tuning ranges:

RBF γ: Start with $1/(d \cdot \text{Var}(X))$, search $[10^{-4}, 10^4]$ on log scale Polynomial degree: Try 2, 3, rarely higher Regularization λ: Search $[10^{-8}, 10^2]$ on log scale Matérn ν: Usually fix at 3/2 or 5/2; rarely tune

Grid search or Bayesian optimization work well for kernel hyperparameter tuning.

Summary

We have surveyed the major families of kernel functions, understanding their mathematical properties, geometric interpretations, and practical use cases.

The Kernel Landscape

Linear/Polynomial: Finite-dimensional feature spaces, explicit interactions
Gaussian RBF: Universal default, infinite-dimensional, smooth functions
Laplacian/Matérn: Control over differentiability, heavier tails
Periodic: Cyclical patterns, seasonality
Composite: Additive/multiplicative combinations for complex structure

Key Takeaways

•Each kernel encodes assumptions — about smoothness, periodicity, locality, and relevance of features.
•RBF is a strong default — universal, flexible, well-understood, but not always ideal.
•Matérn kernels control smoothness — ν=3/2 and ν=5/2 are often more realistic than infinitely-smooth RBF.
•Kernels can be composed — sums and products build complex kernels from simple parts.
•Hyperparameters matter — bandwidth, degree, and length scales require careful tuning.
•Domain knowledge guides selection — known structure (periodicity, linearity) should inform kernel choice.

Module Complete

Congratulations! You have completed Module 1: The Kernel Trick. You now have a comprehensive understanding of:

• Why we need feature space mapping for nonlinear problems • How kernels compute feature-space inner products implicitly • The computational advantages of the kernel trick • Mercer's theorem and the theory of valid kernels • The major kernel families and how to choose among them

In the next module, we will apply these concepts to Kernel Ridge Regression—combining kernels with regularized least squares for powerful nonlinear regression.

5 / 5

Loading learning content...

Machine LearningKernel Methods for Regression

The Kernel Trick

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

5 / 5

Common Kernels

The Kernel Toolkit

With a solid understanding of the kernel trick, feature spaces, and Mercer's theorem, we now turn to the practical question: What kernels should we actually use?

This page provides a comprehensive survey of the most important kernel functions, their mathematical properties, and practical guidance for selection.

What You Will Learn

The Linear Kernel

The simplest kernel is the linear kernel—the identity that corresponds to no feature transformation at all.

Definition

$$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$$

Or with optional constant and scaling: $$k_{\text{linear}}(\mathbf{x}, \mathbf{x}') = c + \gamma \mathbf{x}^\top \mathbf{x}'$$

Feature Map: The identity $\phi(\mathbf{x}) = \mathbf{x}$ (for the basic form).

Properties

Property	Value
Feature dimension	$d$ (input dimension)
Computational cost	$O(d)$
Stationarity	Non-stationary
Universal	No
Characteristic	No

When to Use the Linear Kernel

The linear kernel is appropriate when:

In high-dimensional sparse settings (NLP, genomics), linear kernels often work surprisingly well.

Geometric Interpretation

The linear kernel measures the cosine of the angle between vectors (when both have unit norm) or, more generally, the projection of one vector onto another scaled by magnitudes.

Two vectors $\mathbf{x}$ and $\mathbf{x}'$ have:

High kernel value if they point in similar directions
Zero kernel value if they are orthogonal
Negative kernel value if they point in opposite directions

Connection to Linear Models

Using the linear kernel in kernel ridge regression gives: $$\boldsymbol{\alpha} = (\mathbf{X}\mathbf{X}^\top + \lambda \mathbf{I})^{-1} \mathbf{y}$$

Polynomial Kernels

Polynomial kernels extend linear kernels to capture nonlinear relationships while maintaining a finite-dimensional feature space.

Definition

$$k_{\text{poly}}(\mathbf{x}, \mathbf{x}') = (\gamma \mathbf{x}^\top \mathbf{x}' + c)^p$$

where:

$p \geq 1$ is the polynomial degree
$c \geq 0$ is the coefficient (often 0 or 1)
$\gamma > 0$ is a scale parameter (often 1)

Common variants:

Homogeneous: $c = 0$, only degree-$p$ terms
Inhomogeneous: $c > 0$, includes lower-degree terms

Feature Map

The feature map consists of all monomials up to degree $p$. For example, with $p = 2$, $c = 1$, $d = 2$:

$$\phi(x_1, x_2) = (1, \sqrt{2}x_1, \sqrt{2}x_2, x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$

Feature dimension: $\binom{d + p}{p}$ for inhomogeneous; $\binom{d + p - 1}{p}$ for homogeneous.

Polynomial Kernel Properties
Property	Value
Feature dimension ($c > 0$)	$\binom{d + p}{p}$
Computational cost	$O(d)$
Stationarity	Non-stationary
Universal	No (for finite $p$)
Captures interactions	Yes, up to order $p$

Degree Selection

Degree 1: Equivalent to linear kernel (with offset if $c > 0$)
Degree 2: Captures pairwise interactions ($x_i x_j$ terms); widely used
Degree 3-5: Captures higher-order interactions; risk of overfitting
High degree ($p > 10$): Rarely used; often better to use RBF

Numerical Stability

High-degree polynomial kernels can suffer from numerical issues:

Values grow very large for $|\mathbf{x}^\top \mathbf{x}'| > 1$
Vanish for $|\mathbf{x}^\top \mathbf{x}'| < 1$
Input normalization strongly recommended

Practical Advice: Polynomial kernels work well for NLP (natural language processing) where the input is already high-dimensional and sparse. Degree 2 or 3 is usually sufficient.

polynomial_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
 
def polynomial_kernel(x, y, degree=3, gamma=1.0, coef0=1.0):
    """
    Compute polynomial kernel: (γ x·y + c)^d
    
    Parameters:
        degree: polynomial degree p
        gamma: scale parameter γ
        coef0: constant term c
    """
    return (gamma * np.dot(x, y) + coef0) ** degree
 
# Feature dimension for different degrees
from math import comb
 
print("Polynomial Feature Dimensions")
print("=" * 50)
print(f"{'d (input dim)':>12} {'degree':>8} {'D (feature dim)':>15}")
print("-" * 50)
 
for d in [10, 50, 100, 1000]:
    for p in [2, 3, 4, 5]:
        D = comb(d + p, p)
        print(f"{d:>12} {p:>8} {D:>15,}")
 
# Kernel values for different choices
print("\nKernel Value Sensitivity")
print("=" * 50)
x = np.array([1.0, 2.0, 3.0])
y = np.array([0.5, 1.5, 2.5])
 
inner = np.dot(x, y)
print(f"x · y = {inner}")
print(f"")
 
for degree in [1, 2, 3, 4, 5]:
    for coef0 in [0, 1]:
        k = polynomial_kernel(x, y, degree=degree, coef0=coef0)
        label = f"d={degree}, c={coef0}"
        print(f"  {label:15} k(x,y) = {k:>15.2f}")

The Gaussian RBF Kernel

The Gaussian Radial Basis Function (RBF) kernel is arguably the most important kernel in practice. It provides a universal, infinitely smooth similarity measure based on Euclidean distance.

Definition

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'|^2 \right) = \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\sigma^2} \right)$$

Two equivalent parameterizations:

Gamma form: $\gamma = \frac{1}{2\sigma^2}$ (scikit-learn convention)
Length-scale form: $\sigma$ is the characteristic length scale

Key Properties

Property	Value
Feature dimension	Infinite
Computational cost	$O(d)$
Stationarity	Stationary (translation-invariant)
Universal	Yes
Characteristic	Yes
Self-similarity	$k(\mathbf{x}, \mathbf{x}) = 1$ always
Range	$(0, 1]$

When to Use the RBF Kernel

The RBF kernel is an excellent default choice when:

"When in doubt, try RBF" is reasonable advice for kernel methods.

The Bandwidth Parameter $\gamma$ (or $\sigma$)

The bandwidth controls the locality of the kernel—how quickly similarity decays with distance:

Large $\gamma$ (small $\sigma$): Tight, localized bumps. Each training point mainly affects nearby predictions. Risk of overfitting.
Small $\gamma$ (large $\sigma$): Broad, overlapping bumps. Training points influence distant predictions. Risk of underfitting.
Rule of thumb: Start with $\gamma = \frac{1}{d \cdot \text{Var}(X)}$ (inverse of average squared distance)

Geometric Interpretation

The RBF kernel value decreases as a Gaussian function of distance:

At distance 0: $k = 1$
At distance $\sigma$: $k = e^{-0.5} \approx 0.61$
At distance $2\sigma$: $k = e^{-2} \approx 0.14$
At distance $3\sigma$: $k = e^{-4.5} \appro 0.01$

Points beyond $3\sigma$ have negligible influence—the kernel is effectively local.

rbf_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def rbf_kernel(x, y, gamma=1.0):
    """
    Gaussian RBF kernel: exp(-γ ||x - y||²)
    """
    sq_dist = np.sum((x - y)**2)
    return np.exp(-gamma * sq_dist)
 
def rbf_kernel_matrix(X, gamma=1.0):
    """
    Compute RBF kernel matrix efficiently using broadcasting.
    """
    sq_norms = np.sum(X**2, axis=1)
    sq_dists = sq_norms[:, None] + sq_norms[None, :] - 2 * X @ X.T
    return np.exp(-gamma * sq_dists)
 
# Demonstrate effect of gamma
np.random.seed(42)
n = 5
X = np.random.randn(n, 2)
 
print("RBF Kernel Matrices for Different γ Values")
print("=" * 60)
 
for gamma in [0.1, 1.0, 10.0, 100.0]:
    K = rbf_kernel_matrix(X, gamma)
    print(f"\nγ = {gamma}")
    print(f"  Kernel matrix (first 3x3):")
    print(f"    {K[:3, :3].round(4)}")
    print(f"  Off-diagonal range: [{K[K != 1].min():.4f}, {K[K != 1].max():.4f}]")
    
    # Effective neighborhood size
    threshold = 0.01  # Consider neighbors with k > 0.01
    avg_neighbors = np.mean(np.sum(K > threshold, axis=1))
    print(f"  Avg neighbors (k > 0.01): {avg_neighbors:.1f}")
 
# Heuristic for gamma selection
print("\nGamma Selection Heuristics")
print("=" * 60)
print(f"Median heuristic: γ = 1 / (2 × median(||xᵢ - xⱼ||²))")
pairwise_sq_dists = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
median_sq_dist = np.median(pairwise_sq_dists[pairwise_sq_dists > 0])
gamma_median = 1 / (2 * median_sq_dist)
print(f"  For this data: γ = {gamma_median:.4f}")

Laplacian and Matérn Kernels

The Matérn family of kernels generalizes both the Gaussian RBF and Laplacian kernels, providing control over the smoothness of the target function.

Matérn Kernel

where:

$\nu > 0$ is the smoothness parameter
$\ell > 0$ is the length scale
$K_\nu$ is the modified Bessel function of the second kind

Special Cases

$\nu$	Kernel Name	Differentiability	Formula
$\frac{1}{2}$	Laplacian / Exponential	Not differentiable	$\exp(-r/\ell)$
$\frac{3}{2}$	Matérn 3/2	Once differentiable	$(1 + \sqrt{3}r/\ell) \exp(-\sqrt{3}r/\ell)$
$\frac{5}{2}$	Matérn 5/2	Twice differentiable	$(1 + \sqrt{5}r/\ell + 5r^2/3\ell^2) \exp(-\sqrt{5}r/\ell)$
$\infty$	Gaussian RBF	Infinitely differentiable	$\exp(-r^2/2\ell^2)$

where $r = |\mathbf{x} - \mathbf{x}'|$.

Smoothness Matters

The smoothness parameter $\nu$ controls how differentiable functions in the RKHS are:

The Laplacian Kernel (Matérn-1/2)

$$k_{\text{Laplacian}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\gamma |\mathbf{x} - \mathbf{x}'| \right)$$

Note: uses the L2 norm (not squared as in RBF).

Properties:

Produces functions with kinks (not differentiable)
More robust to outliers than RBF
Decays exponentially rather than Gaussian-ly
Heavier tails: distant points retain more influence

When to Use:

Data with abrupt changes or discontinuities
Piecewise linear relationships
When the RBF seems "too smooth"
Financial time series, certain physical phenomena

matern_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def laplacian_kernel(x, y, gamma=1.0):
    """Laplacian kernel: exp(-γ ||x - y||)"""
    dist = np.sqrt(np.sum((x - y)**2))
    return np.exp(-gamma * dist)
 
def matern_12(r, length_scale=1.0):
    """Matérn ν=1/2 (Laplacian)"""
    return np.exp(-r / length_scale)
 
def matern_32(r, length_scale=1.0):
    """Matérn ν=3/2"""
    scaled = np.sqrt(3) * r / length_scale
    return (1 + scaled) * np.exp(-scaled)
 
def matern_52(r, length_scale=1.0):
    """Matérn ν=5/2"""
    scaled = np.sqrt(5) * r / length_scale
    return (1 + scaled + scaled**2 / 3) * np.exp(-scaled)
 
def rbf_from_r(r, length_scale=1.0):
    """Gaussian RBF (Matérn ν=∞)"""
    return np.exp(-0.5 * (r / length_scale)**2)
 
# Compare kernel shapes
print("Matérn Family Comparison (length_scale = 1)")
print("=" * 60)
print(f"{'Distance r':>12} {'ν=1/2':>10} {'ν=3/2':>10} {'ν=5/2':>10} {'ν=∞(RBF)':>10}")
print("-" * 60)
 
for r in [0, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0]:
    k12 = matern_12(r)
    k32 = matern_32(r)
    k52 = matern_52(r)
    krbf = rbf_from_r(r)
    print(f"{r:>12.1f} {k12:>10.4f} {k32:>10.4f} {k52:>10.4f} {krbf:>10.4f}")
 
print("\nObservation: Smaller ν → heavier tails (slower decay)")
print("Smaller ν → rougher functions in RKHS")

Periodic and Specialized Kernels

Beyond the standard kernels, specialized kernels encode domain-specific structure.

Periodic Kernel

$$k_{\text{periodic}}(\mathbf{x}, \mathbf{x}') = \exp\left( -\frac{2 \sin^2(\pi |x - x'| / p)}{\ell^2} \right)$$

where $p$ is the period and $\ell$ is the length scale.

Properties:

$k(x, x') = k(x + p, x' + p)$ — periodic in its arguments
Perfect for seasonal data, cyclical phenomena
Often combined with other kernels for quasi-periodic patterns

Rational Quadratic Kernel

$$k_{\text{RQ}}(\mathbf{x}, \mathbf{x}') = \left( 1 + \frac{|\mathbf{x} - \mathbf{x}'|^2}{2\alpha\ell^2} \right)^{-\alpha}$$

The RQ kernel is an infinite mixture of RBF kernels with different length scales. The parameter $\alpha$ controls the mixture:

Large $\alpha$: approaches RBF
Small $\alpha$: heavy-tailed, multi-scale behavior

Other Specialized Kernels

•Spectral Mixture Kernel: Mixture of cosines with learned frequencies. Automatic discovery of periodic components.
•ANOVA Kernel: $k(\mathbf{x}, \mathbf{x}') = \sum_{i=1}^d k_1(x_i, x'_i)$. Additive structure, assumes no interactions.
•Sigmoid Kernel: $k(\mathbf{x}, \mathbf{x}') = \tanh(\gamma \mathbf{x}^\top \mathbf{x}' + c)$. Neural network-like. Not always valid! Use with care.
•String Kernels: Compare sequences (DNA, text) by substring matches. $O(\ell^2)$ computation where $\ell$ is string length.
•Graph Kernels: Compare graph structures via random walks, subtree patterns, etc.
•Histogram Intersection Kernel: $k(\mathbf{x}, \mathbf{x}') = \sum_i \min(x_i, x'_i)$. For histogram data (computer vision).

Sigmoid Kernel Warning

specialized_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
 
def periodic_kernel(x, y, length_scale=1.0, period=1.0):
    """Periodic kernel for cyclical patterns."""
    dist = np.abs(x - y)
    return np.exp(-2 * np.sin(np.pi * dist / period)**2 / length_scale**2)
 
def rational_quadratic_kernel(x, y, alpha=1.0, length_scale=1.0):
    """Rational quadratic kernel (infinite mixture of RBFs)."""
    sq_dist = np.sum((x - y)**2)
    return (1 + sq_dist / (2 * alpha * length_scale**2))**(-alpha)
 
def locally_periodic_kernel(x, y, length_scale=1.0, period=1.0, decay=1.0):
    """
    Locally periodic: periodic × RBF
    Captures decaying periodic patterns.
    """
    periodic = periodic_kernel(x, y, length_scale, period)
    rbf = np.exp(-np.sum((x - y)**2) / (2 * decay**2))
    return periodic * rbf
 
# Demonstrate periodic kernel
print("Periodic Kernel Demo (period=2π)")
print("=" * 50)
 
period = 2 * np.pi
for delta in [0, np.pi/4, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 5*np.pi/2]:
    x, y = np.array([0.0]), np.array([delta])
    k = periodic_kernel(x, y, length_scale=1.0, period=period)
    print(f"  |x - y| = {delta/np.pi:.2f}π  →  k(x, y) = {k:.4f}")
 
print("\nNote: k(0, 2π) = k(0, 0) = 1 (exact period)")
 
# Compare RQ kernel for different alpha
print("\nRational Quadratic vs RBF Comparison")
print("=" * 50)
print(f"{'Distance':>10} {'RBF':>10} {'RQ α=1':>10} {'RQ α=10':>10} {'RQ α=0.1':>10}")
print("-" * 50)
 
for r in [0, 0.5, 1.0, 2.0, 5.0]:
    x, y = np.array([0.0]), np.array([r])
    rbf = np.exp(-r**2 / 2)
    rq1 = rational_quadratic_kernel(x, y, alpha=1.0)
    rq10 = rational_quadratic_kernel(x, y, alpha=10.0)
    rq01 = rational_quadratic_kernel(x, y, alpha=0.1)
    print(f"{r:>10.1f} {rbf:>10.4f} {rq1:>10.4f} {rq10:>10.4f} {rq01:>10.4f}")
 
print("\nSmall α: heavier tails (retains similarity at large distances)")

Combining Kernels

Kernel Algebra Recap

Valid kernels can be combined to form new valid kernels:

Sum: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$
- Interpretation: Model is sum of independent components
- Example: Trend + Seasonality
Product: $k(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$
- Interpretation: Both similarities must be high
- Example: Locally periodic (periodic × decay)
Tensor/Direct Sum (for multi-dimensional inputs):
- Additive: $k(\mathbf{x}, \mathbf{x}') = \sum_d k_d(x_d, x'_d)$ — no interactions
- Product: $k(\mathbf{x}, \mathbf{x}') = \prod_d k_d(x_d, x'_d)$ — separable
- ARD (Automatic Relevance Determination): $k(\mathbf{x}, \mathbf{x}') = \exp\left(-\sum_d \frac{(x_d - x'_d)^2}{\ell_d^2}\right)$

Compositional Kernels in Practice

A common pattern for time series:

$k = k_{\text{trend}} + k_{\text{seasonal}} + k_{\text{noise}}$

$= \text{RBF}(\text{long length-scale}) + \text{Periodic} \times \text{RBF}(\text{decay}) + \text{White noise}$

This automatically decomposes the signal into interpretable components!

kernel_composition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
class CompositeKernel:
    """
    Build composite kernels from basic kernels.
    """
    def __init__(self, kernel_func):
        self.kernel_func = kernel_func
    
    def __call__(self, x, y):
        return self.kernel_func(x, y)
    
    def __add__(self, other):
        """Sum of kernels: k1 + k2"""
        return CompositeKernel(lambda x, y: self(x, y) + other(x, y))
    
    def __mul__(self, other):
        """Product of kernels: k1 * k2"""
        if isinstance(other, (int, float)):
            return CompositeKernel(lambda x, y: other * self(x, y))
        return CompositeKernel(lambda x, y: self(x, y) * other(x, y))
    
    def __rmul__(self, other):
        return self.__mul__(other)
 
# Define base kernels
def make_rbf(length_scale):
    return CompositeKernel(
        lambda x, y: np.exp(-np.sum((x-y)**2) / (2*length_scale**2))
    )
 
def make_periodic(period, length_scale):
    return CompositeKernel(
        lambda x, y: np.exp(-2 * np.sin(np.pi * np.abs(x-y) / period)**2 / length_scale**2)
    )
 
def make_linear():
    return CompositeKernel(lambda x, y: np.dot(x, y))
 
def make_white_noise(variance):
    return CompositeKernel(
        lambda x, y: variance if np.allclose(x, y) else 0
    )
 
# Build a composite kernel for time series
# Pattern: long-term trend + decaying seasonality + noise
 
trend = make_rbf(length_scale=10.0)  # Long-range smooth trend
seasonal = make_periodic(period=1.0, length_scale=0.5) * make_rbf(length_scale=5.0)
noise = make_white_noise(variance=0.1)
 
time_series_kernel = 1.0 * trend + 0.5 * seasonal + noise
 
# Evaluate the composite kernel
print("Composite Time Series Kernel")
print("=" * 60)
print("k = 1.0 × RBF(ℓ=10) + 0.5 × (Periodic(p=1) × RBF(ℓ=5)) + WhiteNoise(σ²=0.1)")
print("")
 
# Kernel matrix for sample points
t = np.linspace(0, 5, 10).reshape(-1, 1)
K = np.zeros((10, 10))
for i in range(10):
    for j in range(10):
        K[i, j] = time_series_kernel(t[i], t[j])
 
print("Sample kernel matrix (first 5×5):")
print(K[:5, :5].round(3))

Kernel Selection Guidelines

Choosing the right kernel is a crucial modeling decision. Here we distill practical guidelines based on problem characteristics.

Decision Framework

Kernel Selection Based on Problem Characteristics
Problem Characteristic	Recommended Kernel(s)	Reasoning
Genuinely linear relationship	Linear	Simplest, most interpretable
Unknown smooth function	Gaussian RBF	Universal, flexible, good default
Function with abrupt changes	Laplacian or Matérn-1/2	Non-differentiable functions
Moderately rough function	Matérn-3/2 or 5/2	Finite differentiability
Periodic patterns	Periodic × RBF	Captures cyclical behavior with decay
Multi-scale patterns	Rational Quadratic	Infinite mixture of length scales
High-dimensional sparse data	Linear or low-degree Polynomial	Avoids curse of dimensionality
Feature interactions matter	Polynomial (degree 2-3)	Explicit interaction terms
Text/sequence data	String kernels	Structure-aware comparison
Unknown structure	Multiple kernel learning	Learn the right combination

Practical Tips for Kernel Selection

•Start with RBF: It's a reasonable default for most continuous problems. If it doesn't work, that tells you something about the data structure.
•Visualize kernel matrices: A good kernel should show structure that correlates with the target. Random-looking $\mathbf{K}$ matrices indicate poor fit.
•Cross-validate hyperparameters: Kernel parameters (bandwidth, degree) should be tuned via cross-validation, not guessing.
•Consider input preprocessing: Standardize inputs before applying kernels. Different feature scales dramatically affect distance-based kernels.
•Use domain knowledge: If you know the function is periodic, use a periodic kernel. Don't expect RBF to discover periodicity efficiently.
•Watch for overfitting: Kernels with many hyperparameters can overfit the validation set. Use nested cross-validation for fair comparison.

Hyperparameter Sensitivity

Common hyperparameter defaults and tuning ranges:

Grid search or Bayesian optimization work well for kernel hyperparameter tuning.

Summary

We have surveyed the major families of kernel functions, understanding their mathematical properties, geometric interpretations, and practical use cases.

The Kernel Landscape

Linear/Polynomial: Finite-dimensional feature spaces, explicit interactions
Gaussian RBF: Universal default, infinite-dimensional, smooth functions
Laplacian/Matérn: Control over differentiability, heavier tails
Periodic: Cyclical patterns, seasonality
Composite: Additive/multiplicative combinations for complex structure

Key Takeaways

•Each kernel encodes assumptions — about smoothness, periodicity, locality, and relevance of features.
•RBF is a strong default — universal, flexible, well-understood, but not always ideal.
•Matérn kernels control smoothness — ν=3/2 and ν=5/2 are often more realistic than infinitely-smooth RBF.
•Kernels can be composed — sums and products build complex kernels from simple parts.
•Hyperparameters matter — bandwidth, degree, and length scales require careful tuning.
•Domain knowledge guides selection — known structure (periodicity, linearity) should inform kernel choice.

Module Complete

Congratulations! You have completed Module 1: The Kernel Trick. You now have a comprehensive understanding of:

In the next module, we will apply these concepts to Kernel Ridge Regression—combining kernels with regularized least squares for powerful nonlinear regression.

5 / 5