Kernel Selection And Design - Learning Module

Loading content...

0/278

Domain-Specific Kernels

Beyond Vector Spaces

Standard kernels like RBF and Matérn operate on Euclidean vector spaces—each input is a fixed-length vector of real numbers. However, many real-world data types don't naturally fit this representation:

Strings and sequences (DNA, proteins, text)
Graphs and networks (molecules, social networks)
Sets and bags (documents as word collections)
Trees and hierarchical structures (parse trees, ontologies)
Probability distributions (histograms, density estimates)

Domain-specific kernels enable kernel methods to operate directly on these structured data types, without requiring hand-crafted feature engineering.

What You Will Learn

This page covers kernel functions for strings (spectrum, subsequence), graphs (random walk, Weisfeiler-Lehman), sets and distributions (set kernels, MMD), and images (pyramid match, spatial). You'll understand the mathematical principles that enable kernels over structured data.

String Kernels

String kernels measure similarity between sequences of symbols. They are fundamental in computational biology (DNA/protein analysis), natural language processing (text classification), and cybersecurity (malware detection from code patterns).

The Key Insight:

Strings can be mapped to feature vectors by counting occurrences of substrings, characters, or patterns. The kernel computes the inner product in this (potentially infinite-dimensional) feature space without explicit construction.

Spectrum Kernel (Leslie et al., 2002):

The k-spectrum kernel represents a string by the frequency of all contiguous substrings of length $k$:

$$k(s, t) = \langle \Phi_k(s), \Phi_k(t) \rangle = \sum_{u \in \Sigma^k} #_u(s) \cdot #_u(t)$$

where $\Sigma$ is the alphabet, $\Sigma^k$ is the set of all $k$-mers, and $#_u(s)$ counts occurrences of substring $u$ in string $s$.

Example: For $k=3$ and strings $s=\text{"GATTACA"}$, $t=\text{"ATTGAT"}$:

3-mers in $s$: GAT, ATT, TTA, TAC, ACA
3-mers in $t$: ATT, TTG, TGA, GAT
Shared: GAT (1×1=1), ATT (1×1=1)
$k(s,t) = 2$

Efficient Computation via Suffix Trees

Naively computing the spectrum kernel requires O(|Σ|^k) space. Suffix trees enable O(n + m) computation where n, m are string lengths. This makes spectrum kernels practical for long sequences like genomes.

Subsequence Kernel (Lodhi et al., 2002):

Unlike the spectrum kernel which requires contiguous matches, the subsequence kernel allows gaps, weighted by a decay factor:

$$k(s, t) = \sum_{u \in \Sigma^k} \sum_{\mathbf{i}: s[\mathbf{i}]=u} \sum_{\mathbf{j}: t[\mathbf{j}]=u} \lambda^{l(\mathbf{i}) + l(\mathbf{j})}$$

where $\mathbf{i}$ is a tuple of indices, $s[\mathbf{i}]$ extracts the subsequence at those positions, and $l(\mathbf{i})$ is the "spread" of the indices (last - first + 1). The decay $\lambda \in (0,1)$ penalizes non-contiguous matches.

Intuition: Two strings are similar if they share common subsequences, with contiguous matches worth more than scattered ones.

Mismatch Kernel:

The mismatch kernel extends spectrum kernels to allow up to $m$ mismatches:

$$k(s, t) = \sum_{u \in \Sigma^k} \sum_{v \in N_m(u)} \mathbf{1}[u \sqsubset s] \cdot \mathbf{1}[v \sqsubset t]$$

where $N_m(u)$ is the neighborhood of $u$ (all strings within Hamming distance $m$).

String Kernel Comparison
Kernel	Matches	Gap Penalty	Complexity	Use Case
Spectrum	Exact contiguous k-mers	None (exact only)	O(n + m)	Protein classification, spam detection
Mismatch	Approximate (≤m mismatches)	None	O(n·k·m·\|Σ\|^m)	Mutation-tolerant bio-sequences
Subsequence	Non-contiguous	λ decay	O(n·m·k)	Text classification, semantic similarity
Gap-weighted	Non-contiguous with specific gaps	Per-gap λ	O(n·m·k·g)	Structured sequence patterns

string_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from collections import Counter
import numpy as np
 
def spectrum_kernel(s: str, t: str, k: int = 3) -> float:
    """
    Compute the k-spectrum kernel between two strings.
    
    K(s, t) = <φ(s), φ(t)> where φ maps to k-mer counts.
    """
    def get_kmers(string, k):
        return [string[i:i+k] for i in range(len(string) - k + 1)]
    
    kmers_s = Counter(get_kmers(s, k))
    kmers_t = Counter(get_kmers(t, k))
    
    # Inner product of count vectors
    return sum(kmers_s[kmer] * kmers_t[kmer] 
               for kmer in kmers_s if kmer in kmers_t)
 
def normalized_spectrum_kernel(s: str, t: str, k: int = 3) -> float:
    """
    Normalized spectrum kernel: K(s,t) / sqrt(K(s,s) * K(t,t))
    Produces values in [0, 1].
    """
    k_st = spectrum_kernel(s, t, k)
    k_ss = spectrum_kernel(s, s, k)
    k_tt = spectrum_kernel(t, t, k)
    
    if k_ss == 0 or k_tt == 0:
        return 0.0
    return k_st / np.sqrt(k_ss * k_tt)
 
# Example usage
s1 = "GATTACACGATGCATGCAT"
s2 = "GATCATGCATGCATTACA"
s3 = "AAAAAAAAAAAAAAAAAAA"
 
print(f"K(s1, s2) = {spectrum_kernel(s1, s2, k=4)}")  # High similarity
print(f"K(s1, s3) = {spectrum_kernel(s1, s3, k=4)}")  # Low similarity
print(f"Normalized K(s1, s2) = {normalized_spectrum_kernel(s1, s2, k=4):.4f}")

Graph Kernels

Graph kernels measure similarity between graphs, enabling kernel methods for molecular property prediction, social network analysis, and program analysis. The key challenge is that graphs have variable size and no canonical node ordering.

The Graph Isomorphism Challenge:

Determining whether two graphs are identical (isomorphic) is computationally hard (GI-complete). Practical graph kernels compute approximate similarity based on structural features that are efficient to compute.

Random Walk Kernel (Gärtner et al., 2003):

The random walk kernel counts matching walks between two graphs:

$$k(G, G') = \sum_{l=0}^{\infty} \mu_l \sum_{p \in \text{walks}l(G)} \sum{p' \in \text{walks}l(G')} k{\text{walk}}(p, p')$$

where $\text{walks}l(G)$ is the set of walks of length $l$ in $G$, and $k{\text{walk}}$ compares node/edge labels along walks.

Efficient Computation: Using the direct product graph $G \times G'$ (vertices are pairs of vertices, edges connect pairs if both components are connected), the kernel reduces to counting walks in $G \times G'$:

$$k(G, G') = \mathbf{1}^\top (I - \lambda A_{\times})^{-1} \mathbf{1}$$

where $A_{\times}$ is the adjacency matrix of the product graph and $\lambda < 1/\rho(A_{\times})$ for convergence.

Tottering and Halting

Random walk kernels can suffer from 'tottering'—walks that immediately backtrack. This inflates similarity between graphs with high-degree hubs. Solutions include forbidding backtracking or using shortest-path kernels instead.

Weisfeiler-Lehman (WL) Subtree Kernel (Shervashidze et al., 2011):

The WL kernel uses iterative neighborhood aggregation to create node labels:

Initialize: Each node gets its original label (or degree)
Iterate: Update each node's label by hashing: $\ell^{(h+1)}(v) = \text{hash}(\ell^{(h)}(v), {!{\ell^{(h)}(u) : u \in N(v)}!})$
Count: After $H$ iterations, count occurrences of each label
Compare: Kernel is inner product of label count vectors

$$k(G, G') = \sum_{h=0}^{H} k_{\text{base}}(G^{(h)}, G'^{(h)})$$

where $G^{(h)}$ is graph $G$ with labels from iteration $h$, and $k_{\text{base}}$ is typically a label histogram comparison.

Complexity: $O(Hm)$ per graph where $m$ is the number of edges. Linear in graph size—highly scalable.

Graph Kernel Comparison
Kernel	Feature Type	Complexity	Distinguishing Power
Random Walk	Matching walks	O(n³) naively, O(n²m) optimized	Weak (tottering issues)
Shortest Path	Shortest path lengths + labels	O(n² log n + n²d)	Moderate
Weisfeiler-Lehman	Subtree patterns	O(Hm)	Strong (1-WL equivalent)
Graphlet	Small subgraph counts	O(n^k) for k-graphlets	Strong but expensive
Message Passing (neural)	Learned representations	O(Hm)	Flexible, data-dependent

graph_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from collections import Counter
import hashlib
 
def wl_subtree_kernel(graphs, h=3):
    """
    Weisfeiler-Lehman subtree kernel.
    
    Parameters:
    -----------
    graphs : list of (adjacency_list, node_labels)
    h : number of WL iterations
    
    Returns:
    --------
    K : kernel matrix between all input graphs
    """
    n = len(graphs)
    all_features = []
    
    for adj_list, labels in graphs:
        features = Counter()
        current_labels = labels.copy()
        
        for iteration in range(h + 1):
            # Count current labels
            features.update(current_labels.values())
            
            if iteration < h:
                # WL relabeling
                new_labels = {}
                for node in adj_list:
                    # Aggregate neighbor labels
                    neighbor_labels = sorted([current_labels[n] 
                                              for n in adj_list[node]])
                    # Hash to create new label
                    signature = f"{current_labels[node]}_{'_'.join(neighbor_labels)}"
                    new_labels[node] = hashlib.md5(
                        signature.encode()).hexdigest()[:8]
                current_labels = new_labels
        
        all_features.append(features)
    
    # Compute kernel matrix (histogram intersection or linear)
    all_labels = set()
    for f in all_features:
        all_labels.update(f.keys())
    
    # Vectorize
    X = np.zeros((n, len(all_labels)))
    label_to_idx = {l: i for i, l in enumerate(all_labels)}
    for i, f in enumerate(all_features):
        for label, count in f.items():
            X[i, label_to_idx[label]] = count
    
    # Linear kernel
    return X @ X.T

Set and Distribution Kernels

Many real-world objects are naturally represented as sets (unordered collections) or probability distributions. Examples include:

Documents as sets/bags of words
Images as sets of local descriptors (SIFT, etc.)
Point clouds from 3D scanning
Empirical distributions from samples

Set kernels must be permutation-invariant: the kernel value should not depend on the order of elements.

Set Kernels:

Sum Kernel (naive): $$k(X, Y) = \sum_{x \in X} \sum_{y \in Y} k_{\text{base}}(x, y)$$

This counts total similarity across all pairs but ignores set structure.

Mean Kernel: $$k(X, Y) = \frac{1}{|X||Y|} \sum_{x \in X} \sum_{y \in Y} k_{\text{base}}(x, y)$$

Equivalent to kernel between mean embeddings in the RKHS.

Match Kernel: $$k(X, Y) = \sum_{x \in X} \max_{y \in Y} k_{\text{base}}(x, y)$$

Each element in $X$ matches to its best counterpart in $Y$. Note: not symmetric as written.

Kernel Mean Embeddings

The mean embedding μ_P = E_{x~P}[φ(x)] maps a distribution P to a point in the RKHS. For characteristic kernels (like RBF), this embedding is injective—different distributions map to different points. This enables powerful distribution comparison.

Maximum Mean Discrepancy (MMD):

The MMD measures the distance between two distributions in RKHS:

$$\text{MMD}^2(P, Q) = |\mu_P - \mu_Q|_{\mathcal{H}}^2$$

Expanding: $$\text{MMD}^2 = \mathbb{E}{x,x' \sim P}[k(x,x')] - 2\mathbb{E}{x \sim P, y \sim Q}[k(x,y)] + \mathbb{E}_{y,y' \sim Q}[k(y,y')]$$

Given samples ${x_i}{i=1}^n \sim P$ and ${y_j}{j=1}^m \sim Q$, the unbiased estimator is:

$$\widehat{\text{MMD}}^2 = \frac{1}{n(n-1)} \sum_{i eq j} k(x_i, x_j) - \frac{2}{nm} \sum_{i,j} k(x_i, y_j) + \frac{1}{m(m-1)} \sum_{i eq j} k(y_i, y_j)$$

Applications:

Two-sample hypothesis testing
Domain adaptation
GAN training (MMD-GAN)
Distributional regression

distribution_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from scipy.spatial.distance import cdist
 
def rbf_kernel(X, Y, gamma=1.0):
    """RBF kernel between points."""
    sq_dists = cdist(X, Y, metric='sqeuclidean')
    return np.exp(-gamma * sq_dists)
 
def mmd_squared(X, Y, kernel_fn):
    """
    Compute squared MMD between two sets of samples.
    
    Parameters:
    -----------
    X : array of shape (n, d), samples from P
    Y : array of shape (m, d), samples from Q
    kernel_fn : callable, kernel function
    
    Returns:
    --------
    mmd2 : unbiased estimate of MMD²
    """
    n, m = len(X), len(Y)
    
    Kxx = kernel_fn(X, X)
    Kyy = kernel_fn(Y, Y)
    Kxy = kernel_fn(X, Y)
    
    # Unbiased estimator (exclude diagonals)
    np.fill_diagonal(Kxx, 0)
    np.fill_diagonal(Kyy, 0)
    
    term1 = Kxx.sum() / (n * (n - 1))
    term2 = Kyy.sum() / (m * (m - 1))
    term3 = 2 * Kxy.sum() / (n * m)
    
    return term1 + term2 - term3
 
def set_mean_kernel(X, Y, base_kernel):
    """
    Mean kernel between two sets.
    K(X, Y) = (1/|X||Y|) sum_{x,y} k(x, y)
    """
    K = base_kernel(X, Y)
    return K.mean()
 
# Example: Compare two distributions
np.random.seed(42)
P_samples = np.random.randn(100, 2) * 1.0 + np.array([0, 0])
Q_samples = np.random.randn(100, 2) * 1.0 + np.array([0.5, 0.5])
R_samples = np.random.randn(100, 2) * 1.0 + np.array([3, 3])
 
kernel = lambda X, Y: rbf_kernel(X, Y, gamma=0.5)
print(f"MMD²(P, Q) = {mmd_squared(P_samples, Q_samples, kernel):.4f}")  # Small
print(f"MMD²(P, R) = {mmd_squared(P_samples, R_samples, kernel):.4f}")  # Large

Image Kernels

Before deep learning dominated computer vision, kernel-based methods with specialized image kernels achieved state-of-the-art results. Understanding these kernels remains valuable for interpretability, small-data regimes, and hybrid approaches.

Challenges in Image Comparison:

Translation/rotation variance: Objects can appear anywhere in the image
Scale variance: Same object at different sizes
Illumination variance: Lighting changes
Partial occlusion: Objects partially hidden
Intra-class variation: Different instances look different

Spatial Pyramid Match Kernel (Lazebnik et al., 2006):

The spatial pyramid match (SPM) kernel extends bag-of-words to capture spatial layout:

Extract local features: SIFT, HOG, or learned descriptors
Quantize to visual words: K-means clustering creates a codebook
Build spatial pyramid: Divide image into $2^l \times 2^l$ cells at level $l$
Histogram per cell: Count visual words in each cell
Weighted matching: Match histograms across levels

$$k(I, I') = \sum_{l=0}^{L} \frac{1}{2^{L-l}} \left[ \mathcal{I}^l(I, I') - \mathcal{I}^{l-1}(I, I') \right]$$

where $\mathcal{I}^l$ is the histogram intersection at level $l$. Finer matches (higher $l$) are weighted more than coarse matches.

Why Spatial Pyramids Work

Bag-of-words ignores spatial structure ('sky' features could be at bottom of image). Spatial pyramids enforce that matching features should occur in similar regions. Level 0 is pure BoW; higher levels require tighter spatial correspondence.

Pyramid Match Kernel (Grauman & Darrell, 2005):

The pyramid match kernel (PMK) handles sets of features by multi-resolution histogram matching:

Build histograms: At resolution $i$, bin features into $D \cdot 2^i$ bins (where $D$ is feature dimension)
Compute intersections: $\mathcal{I}^i = \sum_j \min(H^i_I(j), H^i_{I'}(j))$
Weight by level: Matches at finer levels (larger $i$) contribute more

$$k(I, I') = \sum_{i=0}^{L} w_i \left( \mathcal{I}^i - \mathcal{I}^{i+1} \right)$$

Convolution Kernels for Images:

Direct pixel-based kernels rarely work due to sensitivity to translation. The convolution kernel addresses this:

$$k(I, I') = \sum_{\tau} k_{\text{base}}(I_{\text{patch}}(\tau), I'_{\text{patch}}(\tau))$$

summing base kernel evaluations over all spatial positions $\tau$.

Image Kernel Methods
Kernel	Key Idea	Invariances	Era/Status
Histogram Intersection	Min of histogram bins	None (global color)	Classic, simple baseline
Pyramid Match	Multi-resolution matching	Order-invariant	Pre-deep learning SOTA
Spatial Pyramid	Localized BoW matching	Partial translation	Pre-deep learning SOTA
Chi-squared kernel	χ² distance on histograms	Histogram-based	Still used with handcrafted features
Deep kernel	Neural network embeddings	Learned	Current state-of-art

Time Series Kernels

Time series data presents unique challenges for kernel design: variable length, temporal dependencies, and phase shifts (similar patterns occurring at different times).

Dynamic Time Warping (DTW) Kernel:

DTW finds the optimal alignment between two time series by warping the time axis. While DTW distance is not a valid kernel (violates positive definiteness), several kernelized versions exist:

$$k_{\text{DTW}}(\mathbf{x}, \mathbf{y}) = \exp\left(-\gamma \cdot \text{DTW}(\mathbf{x}, \mathbf{y})\right)$$

Warning: This is NOT guaranteed positive definite. In practice, it often works, but theoretically it's a pseudo-kernel.

Global Alignment Kernel (Cuturi, 2007):

A proper positive-definite DTW-like kernel:

$$k_{\text{GA}}(\mathbf{x}, \mathbf{y}) = \sum_{\pi \in \mathcal{A}(|\mathbf{x}|, |\mathbf{y}|)} \prod_{(i,j) \in \pi} k_{\text{local}}(x_i, y_j)$$

summing over all valid alignments $\pi$ in the alignment space $\mathcal{A}$.

Signature Kernels (Kiraly & Oberhauser, 2019):

The path signature is a mathematical object that uniquely characterizes continuous paths (modulo tree-like equivalence). The signature kernel measures similarity between paths via their signatures:

$$k(\mathbf{x}, \mathbf{y}) = \langle S(\mathbf{x}), S(\mathbf{y}) \rangle$$

where $S(\mathbf{x})$ is the truncated signature (a collection of iterated integrals).

Properties:

Naturally handles variable-length sequences
Captures complex dependencies through iterated integrals
Provably universal for continuous paths
Efficient computation via kernel trick (PDE solving)

Fourier/Spectral Kernels for Time Series:

Compare time series via their frequency content:

$$k(\mathbf{x}, \mathbf{y}) = k_{\text{base}}\left(|\mathcal{F}(\mathbf{x})|, |\mathcal{F}(\mathbf{y})|\right)$$

where $\mathcal{F}$ is the discrete Fourier transform. This is phase-invariant but loses temporal localization.

time_series_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from scipy.spatial.distance import cdist
 
def global_alignment_kernel(x, y, sigma=1.0, triangular=False):
    """
    Global Alignment Kernel for time series.
    
    Uses dynamic programming to sum over all alignments.
    """
    n, m = len(x), len(y)
    
    # Local kernel matrix
    K_local = np.exp(-cdist(x.reshape(-1,1), y.reshape(-1,1), 
                            'sqeuclidean') / (2 * sigma**2))
    
    # DP table for alignment sums
    GA = np.zeros((n + 1, m + 1))
    GA[0, 0] = 1  # Base case
    
    # Fill DP table
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            # Sum of paths coming from (i-1,j), (i,j-1), (i-1,j-1)
            GA[i, j] = (GA[i-1, j-1] + GA[i-1, j] + GA[i, j-1]) * K_local[i-1, j-1]
            if triangular:
                # Sakoe-Chiba constraint for efficiency
                if abs(i - j) > max(n, m) // 4:
                    GA[i, j] = 0
    
    return GA[n, m]
 
def spectral_kernel(x, y, gamma=1.0):
    """
    Compare time series via magnitude spectra.
    Phase-invariant comparison.
    """
    # Compute magnitude of DFT
    X_mag = np.abs(np.fft.fft(x))
    Y_mag = np.abs(np.fft.fft(y))
    
    # RBF on magnitude spectra
    dist = np.sum((X_mag - Y_mag)**2)
    return np.exp(-gamma * dist)
 
# Example
x = np.sin(np.linspace(0, 4*np.pi, 100))
y = np.sin(np.linspace(0.5, 4.5*np.pi, 100))  # Phase shifted
z = np.cos(np.linspace(0, 4*np.pi, 100))      # Different frequency
 
print(f"GA Kernel (x, y): {global_alignment_kernel(x, y):.4e}")  
print(f"GA Kernel (x, z): {global_alignment_kernel(x, z):.4e}")

Summary: Domain-Specific Kernel Design Principles

Designing domain-specific kernels follows a principled approach: encode the invariances and similarities that matter for your domain while maintaining positive definiteness.

Key Takeaways

•String kernels (spectrum, subsequence, mismatch) measure similarity based on shared substructures. Essential for bioinformatics and text classification.
•Graph kernels (WL, random walk) compare structural patterns. WL kernel is efficient and powerful for molecular and social network data.
•Set/distribution kernels (mean embedding, MMD) compare collections of objects. MMD enables powerful two-sample testing.
•Image kernels (spatial pyramid, pyramid match) capture spatial structure in visual data. Historical importance, now often replaced by deep features.
•Time series kernels (global alignment, signature) handle variable-length sequences with temporal dependencies.
•Always verify positive definiteness — applying a non-PD distance function through exp(-γd) does not guarantee a valid kernel.

Page Complete

You now understand how to design kernels for structured data types beyond vectors. Next, we'll explore Multiple Kernel Learning—combining multiple kernels to leverage complementary information sources.