Loading content...
Standard kernels like RBF and Matérn operate on Euclidean vector spaces—each input is a fixed-length vector of real numbers. However, many real-world data types don't naturally fit this representation:
Domain-specific kernels enable kernel methods to operate directly on these structured data types, without requiring hand-crafted feature engineering.
This page covers kernel functions for strings (spectrum, subsequence), graphs (random walk, Weisfeiler-Lehman), sets and distributions (set kernels, MMD), and images (pyramid match, spatial). You'll understand the mathematical principles that enable kernels over structured data.
String kernels measure similarity between sequences of symbols. They are fundamental in computational biology (DNA/protein analysis), natural language processing (text classification), and cybersecurity (malware detection from code patterns).
The Key Insight:
Strings can be mapped to feature vectors by counting occurrences of substrings, characters, or patterns. The kernel computes the inner product in this (potentially infinite-dimensional) feature space without explicit construction.
Spectrum Kernel (Leslie et al., 2002):
The k-spectrum kernel represents a string by the frequency of all contiguous substrings of length $k$:
$$k(s, t) = \langle \Phi_k(s), \Phi_k(t) \rangle = \sum_{u \in \Sigma^k} #_u(s) \cdot #_u(t)$$
where $\Sigma$ is the alphabet, $\Sigma^k$ is the set of all $k$-mers, and $#_u(s)$ counts occurrences of substring $u$ in string $s$.
Example: For $k=3$ and strings $s=\text{"GATTACA"}$, $t=\text{"ATTGAT"}$:
Naively computing the spectrum kernel requires O(|Σ|^k) space. Suffix trees enable O(n + m) computation where n, m are string lengths. This makes spectrum kernels practical for long sequences like genomes.
Subsequence Kernel (Lodhi et al., 2002):
Unlike the spectrum kernel which requires contiguous matches, the subsequence kernel allows gaps, weighted by a decay factor:
$$k(s, t) = \sum_{u \in \Sigma^k} \sum_{\mathbf{i}: s[\mathbf{i}]=u} \sum_{\mathbf{j}: t[\mathbf{j}]=u} \lambda^{l(\mathbf{i}) + l(\mathbf{j})}$$
where $\mathbf{i}$ is a tuple of indices, $s[\mathbf{i}]$ extracts the subsequence at those positions, and $l(\mathbf{i})$ is the "spread" of the indices (last - first + 1). The decay $\lambda \in (0,1)$ penalizes non-contiguous matches.
Intuition: Two strings are similar if they share common subsequences, with contiguous matches worth more than scattered ones.
Mismatch Kernel:
The mismatch kernel extends spectrum kernels to allow up to $m$ mismatches:
$$k(s, t) = \sum_{u \in \Sigma^k} \sum_{v \in N_m(u)} \mathbf{1}[u \sqsubset s] \cdot \mathbf{1}[v \sqsubset t]$$
where $N_m(u)$ is the neighborhood of $u$ (all strings within Hamming distance $m$).
| Kernel | Matches | Gap Penalty | Complexity | Use Case |
|---|---|---|---|---|
| Spectrum | Exact contiguous k-mers | None (exact only) | O(n + m) | Protein classification, spam detection |
| Mismatch | Approximate (≤m mismatches) | None | O(n·k·m·|Σ|^m) | Mutation-tolerant bio-sequences |
| Subsequence | Non-contiguous | λ decay | O(n·m·k) | Text classification, semantic similarity |
| Gap-weighted | Non-contiguous with specific gaps | Per-gap λ | O(n·m·k·g) | Structured sequence patterns |
12345678910111213141516171819202122232425262728293031323334353637383940
from collections import Counterimport numpy as np def spectrum_kernel(s: str, t: str, k: int = 3) -> float: """ Compute the k-spectrum kernel between two strings. K(s, t) = <φ(s), φ(t)> where φ maps to k-mer counts. """ def get_kmers(string, k): return [string[i:i+k] for i in range(len(string) - k + 1)] kmers_s = Counter(get_kmers(s, k)) kmers_t = Counter(get_kmers(t, k)) # Inner product of count vectors return sum(kmers_s[kmer] * kmers_t[kmer] for kmer in kmers_s if kmer in kmers_t) def normalized_spectrum_kernel(s: str, t: str, k: int = 3) -> float: """ Normalized spectrum kernel: K(s,t) / sqrt(K(s,s) * K(t,t)) Produces values in [0, 1]. """ k_st = spectrum_kernel(s, t, k) k_ss = spectrum_kernel(s, s, k) k_tt = spectrum_kernel(t, t, k) if k_ss == 0 or k_tt == 0: return 0.0 return k_st / np.sqrt(k_ss * k_tt) # Example usages1 = "GATTACACGATGCATGCAT"s2 = "GATCATGCATGCATTACA"s3 = "AAAAAAAAAAAAAAAAAAA" print(f"K(s1, s2) = {spectrum_kernel(s1, s2, k=4)}") # High similarityprint(f"K(s1, s3) = {spectrum_kernel(s1, s3, k=4)}") # Low similarityprint(f"Normalized K(s1, s2) = {normalized_spectrum_kernel(s1, s2, k=4):.4f}")Graph kernels measure similarity between graphs, enabling kernel methods for molecular property prediction, social network analysis, and program analysis. The key challenge is that graphs have variable size and no canonical node ordering.
The Graph Isomorphism Challenge:
Determining whether two graphs are identical (isomorphic) is computationally hard (GI-complete). Practical graph kernels compute approximate similarity based on structural features that are efficient to compute.
Random Walk Kernel (Gärtner et al., 2003):
The random walk kernel counts matching walks between two graphs:
$$k(G, G') = \sum_{l=0}^{\infty} \mu_l \sum_{p \in \text{walks}l(G)} \sum{p' \in \text{walks}l(G')} k{\text{walk}}(p, p')$$
where $\text{walks}l(G)$ is the set of walks of length $l$ in $G$, and $k{\text{walk}}$ compares node/edge labels along walks.
Efficient Computation: Using the direct product graph $G \times G'$ (vertices are pairs of vertices, edges connect pairs if both components are connected), the kernel reduces to counting walks in $G \times G'$:
$$k(G, G') = \mathbf{1}^\top (I - \lambda A_{\times})^{-1} \mathbf{1}$$
where $A_{\times}$ is the adjacency matrix of the product graph and $\lambda < 1/\rho(A_{\times})$ for convergence.
Random walk kernels can suffer from 'tottering'—walks that immediately backtrack. This inflates similarity between graphs with high-degree hubs. Solutions include forbidding backtracking or using shortest-path kernels instead.
Weisfeiler-Lehman (WL) Subtree Kernel (Shervashidze et al., 2011):
The WL kernel uses iterative neighborhood aggregation to create node labels:
$$k(G, G') = \sum_{h=0}^{H} k_{\text{base}}(G^{(h)}, G'^{(h)})$$
where $G^{(h)}$ is graph $G$ with labels from iteration $h$, and $k_{\text{base}}$ is typically a label histogram comparison.
Complexity: $O(Hm)$ per graph where $m$ is the number of edges. Linear in graph size—highly scalable.
| Kernel | Feature Type | Complexity | Distinguishing Power |
|---|---|---|---|
| Random Walk | Matching walks | O(n³) naively, O(n²m) optimized | Weak (tottering issues) |
| Shortest Path | Shortest path lengths + labels | O(n² log n + n²d) | Moderate |
| Weisfeiler-Lehman | Subtree patterns | O(Hm) | Strong (1-WL equivalent) |
| Graphlet | Small subgraph counts | O(n^k) for k-graphlets | Strong but expensive |
| Message Passing (neural) | Learned representations | O(Hm) | Flexible, data-dependent |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom collections import Counterimport hashlib def wl_subtree_kernel(graphs, h=3): """ Weisfeiler-Lehman subtree kernel. Parameters: ----------- graphs : list of (adjacency_list, node_labels) h : number of WL iterations Returns: -------- K : kernel matrix between all input graphs """ n = len(graphs) all_features = [] for adj_list, labels in graphs: features = Counter() current_labels = labels.copy() for iteration in range(h + 1): # Count current labels features.update(current_labels.values()) if iteration < h: # WL relabeling new_labels = {} for node in adj_list: # Aggregate neighbor labels neighbor_labels = sorted([current_labels[n] for n in adj_list[node]]) # Hash to create new label signature = f"{current_labels[node]}_{'_'.join(neighbor_labels)}" new_labels[node] = hashlib.md5( signature.encode()).hexdigest()[:8] current_labels = new_labels all_features.append(features) # Compute kernel matrix (histogram intersection or linear) all_labels = set() for f in all_features: all_labels.update(f.keys()) # Vectorize X = np.zeros((n, len(all_labels))) label_to_idx = {l: i for i, l in enumerate(all_labels)} for i, f in enumerate(all_features): for label, count in f.items(): X[i, label_to_idx[label]] = count # Linear kernel return X @ X.TMany real-world objects are naturally represented as sets (unordered collections) or probability distributions. Examples include:
Set kernels must be permutation-invariant: the kernel value should not depend on the order of elements.
Set Kernels:
Sum Kernel (naive): $$k(X, Y) = \sum_{x \in X} \sum_{y \in Y} k_{\text{base}}(x, y)$$
This counts total similarity across all pairs but ignores set structure.
Mean Kernel: $$k(X, Y) = \frac{1}{|X||Y|} \sum_{x \in X} \sum_{y \in Y} k_{\text{base}}(x, y)$$
Equivalent to kernel between mean embeddings in the RKHS.
Match Kernel: $$k(X, Y) = \sum_{x \in X} \max_{y \in Y} k_{\text{base}}(x, y)$$
Each element in $X$ matches to its best counterpart in $Y$. Note: not symmetric as written.
The mean embedding μ_P = E_{x~P}[φ(x)] maps a distribution P to a point in the RKHS. For characteristic kernels (like RBF), this embedding is injective—different distributions map to different points. This enables powerful distribution comparison.
Maximum Mean Discrepancy (MMD):
The MMD measures the distance between two distributions in RKHS:
$$\text{MMD}^2(P, Q) = |\mu_P - \mu_Q|_{\mathcal{H}}^2$$
Expanding: $$\text{MMD}^2 = \mathbb{E}{x,x' \sim P}[k(x,x')] - 2\mathbb{E}{x \sim P, y \sim Q}[k(x,y)] + \mathbb{E}_{y,y' \sim Q}[k(y,y')]$$
Given samples ${x_i}{i=1}^n \sim P$ and ${y_j}{j=1}^m \sim Q$, the unbiased estimator is:
$$\widehat{\text{MMD}}^2 = \frac{1}{n(n-1)} \sum_{i eq j} k(x_i, x_j) - \frac{2}{nm} \sum_{i,j} k(x_i, y_j) + \frac{1}{m(m-1)} \sum_{i eq j} k(y_i, y_j)$$
Applications:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom scipy.spatial.distance import cdist def rbf_kernel(X, Y, gamma=1.0): """RBF kernel between points.""" sq_dists = cdist(X, Y, metric='sqeuclidean') return np.exp(-gamma * sq_dists) def mmd_squared(X, Y, kernel_fn): """ Compute squared MMD between two sets of samples. Parameters: ----------- X : array of shape (n, d), samples from P Y : array of shape (m, d), samples from Q kernel_fn : callable, kernel function Returns: -------- mmd2 : unbiased estimate of MMD² """ n, m = len(X), len(Y) Kxx = kernel_fn(X, X) Kyy = kernel_fn(Y, Y) Kxy = kernel_fn(X, Y) # Unbiased estimator (exclude diagonals) np.fill_diagonal(Kxx, 0) np.fill_diagonal(Kyy, 0) term1 = Kxx.sum() / (n * (n - 1)) term2 = Kyy.sum() / (m * (m - 1)) term3 = 2 * Kxy.sum() / (n * m) return term1 + term2 - term3 def set_mean_kernel(X, Y, base_kernel): """ Mean kernel between two sets. K(X, Y) = (1/|X||Y|) sum_{x,y} k(x, y) """ K = base_kernel(X, Y) return K.mean() # Example: Compare two distributionsnp.random.seed(42)P_samples = np.random.randn(100, 2) * 1.0 + np.array([0, 0])Q_samples = np.random.randn(100, 2) * 1.0 + np.array([0.5, 0.5])R_samples = np.random.randn(100, 2) * 1.0 + np.array([3, 3]) kernel = lambda X, Y: rbf_kernel(X, Y, gamma=0.5)print(f"MMD²(P, Q) = {mmd_squared(P_samples, Q_samples, kernel):.4f}") # Smallprint(f"MMD²(P, R) = {mmd_squared(P_samples, R_samples, kernel):.4f}") # LargeBefore deep learning dominated computer vision, kernel-based methods with specialized image kernels achieved state-of-the-art results. Understanding these kernels remains valuable for interpretability, small-data regimes, and hybrid approaches.
Challenges in Image Comparison:
Spatial Pyramid Match Kernel (Lazebnik et al., 2006):
The spatial pyramid match (SPM) kernel extends bag-of-words to capture spatial layout:
$$k(I, I') = \sum_{l=0}^{L} \frac{1}{2^{L-l}} \left[ \mathcal{I}^l(I, I') - \mathcal{I}^{l-1}(I, I') \right]$$
where $\mathcal{I}^l$ is the histogram intersection at level $l$. Finer matches (higher $l$) are weighted more than coarse matches.
Bag-of-words ignores spatial structure ('sky' features could be at bottom of image). Spatial pyramids enforce that matching features should occur in similar regions. Level 0 is pure BoW; higher levels require tighter spatial correspondence.
Pyramid Match Kernel (Grauman & Darrell, 2005):
The pyramid match kernel (PMK) handles sets of features by multi-resolution histogram matching:
$$k(I, I') = \sum_{i=0}^{L} w_i \left( \mathcal{I}^i - \mathcal{I}^{i+1} \right)$$
Convolution Kernels for Images:
Direct pixel-based kernels rarely work due to sensitivity to translation. The convolution kernel addresses this:
$$k(I, I') = \sum_{\tau} k_{\text{base}}(I_{\text{patch}}(\tau), I'_{\text{patch}}(\tau))$$
summing base kernel evaluations over all spatial positions $\tau$.
| Kernel | Key Idea | Invariances | Era/Status |
|---|---|---|---|
| Histogram Intersection | Min of histogram bins | None (global color) | Classic, simple baseline |
| Pyramid Match | Multi-resolution matching | Order-invariant | Pre-deep learning SOTA |
| Spatial Pyramid | Localized BoW matching | Partial translation | Pre-deep learning SOTA |
| Chi-squared kernel | χ² distance on histograms | Histogram-based | Still used with handcrafted features |
| Deep kernel | Neural network embeddings | Learned | Current state-of-art |
Time series data presents unique challenges for kernel design: variable length, temporal dependencies, and phase shifts (similar patterns occurring at different times).
Dynamic Time Warping (DTW) Kernel:
DTW finds the optimal alignment between two time series by warping the time axis. While DTW distance is not a valid kernel (violates positive definiteness), several kernelized versions exist:
$$k_{\text{DTW}}(\mathbf{x}, \mathbf{y}) = \exp\left(-\gamma \cdot \text{DTW}(\mathbf{x}, \mathbf{y})\right)$$
Warning: This is NOT guaranteed positive definite. In practice, it often works, but theoretically it's a pseudo-kernel.
Global Alignment Kernel (Cuturi, 2007):
A proper positive-definite DTW-like kernel:
$$k_{\text{GA}}(\mathbf{x}, \mathbf{y}) = \sum_{\pi \in \mathcal{A}(|\mathbf{x}|, |\mathbf{y}|)} \prod_{(i,j) \in \pi} k_{\text{local}}(x_i, y_j)$$
summing over all valid alignments $\pi$ in the alignment space $\mathcal{A}$.
Signature Kernels (Kiraly & Oberhauser, 2019):
The path signature is a mathematical object that uniquely characterizes continuous paths (modulo tree-like equivalence). The signature kernel measures similarity between paths via their signatures:
$$k(\mathbf{x}, \mathbf{y}) = \langle S(\mathbf{x}), S(\mathbf{y}) \rangle$$
where $S(\mathbf{x})$ is the truncated signature (a collection of iterated integrals).
Properties:
Fourier/Spectral Kernels for Time Series:
Compare time series via their frequency content:
$$k(\mathbf{x}, \mathbf{y}) = k_{\text{base}}\left(|\mathcal{F}(\mathbf{x})|, |\mathcal{F}(\mathbf{y})|\right)$$
where $\mathcal{F}$ is the discrete Fourier transform. This is phase-invariant but loses temporal localization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npfrom scipy.spatial.distance import cdist def global_alignment_kernel(x, y, sigma=1.0, triangular=False): """ Global Alignment Kernel for time series. Uses dynamic programming to sum over all alignments. """ n, m = len(x), len(y) # Local kernel matrix K_local = np.exp(-cdist(x.reshape(-1,1), y.reshape(-1,1), 'sqeuclidean') / (2 * sigma**2)) # DP table for alignment sums GA = np.zeros((n + 1, m + 1)) GA[0, 0] = 1 # Base case # Fill DP table for i in range(1, n + 1): for j in range(1, m + 1): # Sum of paths coming from (i-1,j), (i,j-1), (i-1,j-1) GA[i, j] = (GA[i-1, j-1] + GA[i-1, j] + GA[i, j-1]) * K_local[i-1, j-1] if triangular: # Sakoe-Chiba constraint for efficiency if abs(i - j) > max(n, m) // 4: GA[i, j] = 0 return GA[n, m] def spectral_kernel(x, y, gamma=1.0): """ Compare time series via magnitude spectra. Phase-invariant comparison. """ # Compute magnitude of DFT X_mag = np.abs(np.fft.fft(x)) Y_mag = np.abs(np.fft.fft(y)) # RBF on magnitude spectra dist = np.sum((X_mag - Y_mag)**2) return np.exp(-gamma * dist) # Examplex = np.sin(np.linspace(0, 4*np.pi, 100))y = np.sin(np.linspace(0.5, 4.5*np.pi, 100)) # Phase shiftedz = np.cos(np.linspace(0, 4*np.pi, 100)) # Different frequency print(f"GA Kernel (x, y): {global_alignment_kernel(x, y):.4e}") print(f"GA Kernel (x, z): {global_alignment_kernel(x, z):.4e}")Designing domain-specific kernels follows a principled approach: encode the invariances and similarities that matter for your domain while maintaining positive definiteness.
You now understand how to design kernels for structured data types beyond vectors. Next, we'll explore Multiple Kernel Learning—combining multiple kernels to leverage complementary information sources.