Kernel Svm - Learning Module

Loading content...

0/245

Common Kernels

The Kernel Library

The kernel function is the core design choice in kernel SVMs—it determines the implicit feature space and thus the expressiveness of the learned classifier. While infinitely many valid kernels exist (and domain-specific kernels can be constructed for specialized applications), a small set of kernels handles the vast majority of practical problems.

In this page, we conduct a deep dive into the four most influential kernel functions:

Linear kernel — The baseline; equivalent to the original SVM.
Polynomial kernel — Captures feature interactions up to a specified degree.
RBF (Gaussian) kernel — The universal default; infinite-dimensional feature space.
Sigmoid kernel — Historically important; connections to neural networks.

For each kernel, we'll examine its mathematical form, implicit feature space, hyperparameters, decision boundary characteristics, and practical considerations.

Learning Objectives

By completing this page, you will: (1) Know the mathematical form and feature space of each major kernel, (2) Understand how hyperparameters affect decision boundaries, (3) Recognize when to use each kernel type, and (4) Appreciate the tradeoffs between expressiveness and overfitting risk.

The Linear Kernel

The simplest kernel is no kernel at all—or more precisely, the kernel that corresponds to the identity feature map.

Definition

$$k(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z} = \sum_{i=1}^d x_i z_i$$

Implicit Feature Map

The identity: $\phi(\mathbf{x}) = \mathbf{x}$

The feature space is simply the original input space $\mathbb{R}^d$. There's no transformation; the linear kernel is what we get if we run standard SVM without any kernel at all.

Hyperparameters

None. The linear kernel has no tunable parameters.

Decision Boundary

The decision boundary is a hyperplane in the original feature space: $$\mathbf{w}^T\mathbf{x} + b = 0$$

In 2D, this is a straight line. In 3D, a plane. In higher dimensions, a hyperplane.

When to Use Linear Kernels

Linear kernels are surprisingly effective when: (1) The data is already high-dimensional (e.g., text with bag-of-words features), (2) The number of features exceeds the number of samples ($d > n$), (3) The problem is actually linearly separable or nearly so, (4) You need fast training and prediction.

Linear Kernel: Pros and Cons
Advantages	Disadvantages
Fast training (primal solvers available)	Cannot capture nonlinear relationships
Fast prediction: $O(d)$ per point	May underfit complex data
No hyperparameter tuning needed	Decision boundary limited to hyperplane
Scales well to large datasets	Poor for problems with feature interactions
Interpretable weights	Assumes linearity in input space

Practical Note: Linear SVM Optimizations

For linear kernels, specialized algorithms exist that are much faster than general kernel SVM:

LIBLINEAR: Coordinate descent in primal form; $O(nd)$ per pass.
Pegasos: Stochastic gradient descent; $O(d)$ per sample update.
Dual coordinate descent: Efficient for primal-dual conversion.

These methods avoid forming the $n \times n$ kernel matrix entirely, enabling training on millions of samples.

For very high-dimensional sparse data (e.g., text classification), linear SVM is often the first algorithm to try—it's fast, scales well, and frequently performs competitively with nonlinear methods.

The Polynomial Kernel

The polynomial kernel explicitly models feature interactions up to a specified degree, enabling polynomial decision boundaries.

Definition

$$k(\mathbf{x}, \mathbf{z}) = (\gamma , \mathbf{x}^T\mathbf{z} + c)^p$$

Alternative parameterization (common in sklearn): $$k(\mathbf{x}, \mathbf{z}) = (\gamma , \mathbf{x}^T\mathbf{z} + r)^d$$

where:

$p$ (or $d$): Polynomial degree (integer ≥ 1)
$c$ (or $r$): Offset parameter (called coef0); $c \geq 0$
$\gamma$: Scaling parameter; often set to $1/d$ (number of features)

Implicit Feature Space

The polynomial kernel corresponds to a feature space containing all monomials up to degree $p$.

Example: For $\mathbf{x} = (x_1, x_2)$ and degree $p = 2$ with $c = 1$: $$k(\mathbf{x}, \mathbf{z}) = (x_1 z_1 + x_2 z_2 + 1)^2$$

Expanding: $$= x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1 x_2 z_1 z_2 + 2x_1 z_1 + 2x_2 z_2 + 1$$

This corresponds to the feature map: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2}x_1, \sqrt{2}x_2, 1)$$

Six features from two original features!

Feature Space Dimension

For degree $p$ and original dimension $d$, the feature space dimension is $\binom{d+p}{p}$. For $d=100$ and $p=3$: 176,851 features. For $p=5$: 96,560,646 features! The kernel trick makes this tractable by computing the inner product directly without forming the features.

The Role of Hyperparameters

Degree $p$:

$p = 1$: Equivalent to linear kernel (if $\gamma = 1, c = 0$)
$p = 2$: Quadratic; captures pairwise interactions
$p = 3$: Cubic; captures three-way interactions
Higher $p$: More complex boundaries, higher overfitting risk

Offset $c$:

$c = 0$ (homogeneous polynomial): Only degree-$p$ monomials
$c > 0$ (inhomogeneous): Includes monomials of all degrees ≤ $p$
Larger $c$: More weight on lower-degree terms

Scaling $\gamma$:

Default often $\gamma = 1/d$
Larger $\gamma$: Amplifies dot product, more sensitive to individual features
Smaller $\gamma$: Dampens influence, smoother decision boundary

Decision Boundary

Polynomial kernels produce polynomial decision boundaries:

Degree 2: Conic sections (ellipses, hyperbolas, parabolas)
Degree 3: Cubic curves
Higher degrees: Increasingly complex polynomial curves

polynomial_kernel_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from math import comb
 
def polynomial_kernel(x: np.ndarray, z: np.ndarray,
                      degree: int = 3, gamma: float = None, 
                      coef0: float = 1.0) -> float:
    """
    Polynomial kernel: k(x,z) = (gamma * x^T z + coef0)^degree
    
    Parameters:
        x, z: Input vectors
        degree: Polynomial degree (default 3)
        gamma: Scaling factor (default 1/n_features)
        coef0: Offset parameter (default 1.0)
    """
    if gamma is None:
        gamma = 1.0 / len(x)
    return (gamma * np.dot(x, z) + coef0) ** degree
 
def explicit_degree2_features(x: np.ndarray, c: float = 1.0) -> np.ndarray:
    """
    Explicitly compute degree-2 polynomial features for 2D input.
    
    Demonstrates what the polynomial kernel implicitly computes.
    """
    x1, x2 = x
    sqrt_2 = np.sqrt(2)
    sqrt_c = np.sqrt(c)
    sqrt_2c = np.sqrt(2 * c)
    
    # Features corresponding to (x1*z1 + x2*z2 + c)^2
    return np.array([
        x1**2,           # x1^2 term
        x2**2,           # x2^2 term
        sqrt_2 * x1 * x2,  # 2*x1*x2 cross term (sqrt(2) for normalization)
        sqrt_2c * x1,    # 2c*x1 term (linear)
        sqrt_2c * x2,    # 2c*x2 term (linear)
        c                # c^2 constant term (just c since sqrt(c)^2*sqrt(c)^2=c^2... wait)
        # Actually: the constant should be sqrt(c)*sqrt(c) = c for inner product = c^2
        # Let me recalculate properly...
    ])
 
# Verify kernel trick for polynomial
x = np.array([1.0, 2.0])
z = np.array([3.0, 4.0])
c = 1.0
 
# Direct kernel computation
kernel_value = polynomial_kernel(x, z, degree=2, gamma=1.0, coef0=c)
 
# Expand and verify
# (x1*z1 + x2*z2 + c)^2 = 
# x1^2*z1^2 + x2^2*z2^2 + c^2 + 2*x1*z1*x2*z2 + 2*c*x1*z1 + 2*c*x2*z2
expanded = (x[0]**2 * z[0]**2 + x[1]**2 * z[1]**2 + c**2 + 
            2*x[0]*z[0]*x[1]*z[1] + 2*c*x[0]*z[0] + 2*c*x[1]*z[1])
 
print("Polynomial kernel verification:")
print(f"  Direct: (x·z + c)^2 = ({np.dot(x,z)} + {c})^2 = {kernel_value}")
print(f"  Expanded: {expanded}")
print(f"  Match: {np.isclose(kernel_value, expanded)}")
 
# Feature space dimension growth
print("\nFeature space dimensions for polynomial kernels:")
print("-" * 50)
for d in [2, 10, 50, 100]:
    for p in [2, 3, 5]:
        dim = comb(d + p, p)
        print(f"  Original d={d:3d}, degree p={p}: {dim:>15,} features")
        
# Computational comparison
print("\nComputational comparison (d=100, degree 5):")
d, p = 100, 5
kernel_ops = d + 1  # dot product + power
explicit_dim = comb(d + p, p)
print(f"  Kernel computation: O({kernel_ops}) operations")
print(f"  Explicit features: O({explicit_dim:,}) operations to compute")
print(f"  Speedup factor: {explicit_dim / kernel_ops:,.0f}x")

Polynomial Kernel: Pros and Cons
Advantages	Disadvantages
Captures feature interactions naturally	Degree selection can be difficult
Smooth, interpretable decision boundaries	Higher degrees risk overfitting
Works well for normalized features	Computationally slower than linear
Good for image recognition tasks	Numerical issues with high degrees
Explicit feature interpretation possible	May struggle with very complex patterns

The RBF (Gaussian) Kernel

The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is the most widely used kernel in practice. It's the default choice for kernel SVM and offers extraordinary flexibility.

Definition

$$k(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma |\mathbf{x} - \mathbf{z}|^2\right) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right)$$

where:

$\gamma = \frac{1}{2\sigma^2}$: Width parameter (inverse squared bandwidth)
$\sigma$: Bandwidth parameter (standard deviation of Gaussian)

The kernel value depends only on the distance between points, not their absolute positions—hence "radial basis function."

Key Properties

Bounded output: $0 < k(\mathbf{x}, \mathbf{z}) \leq 1$
- $k(\mathbf{x}, \mathbf{x}) = 1$ (self-similarity)
- $k(\mathbf{x}, \mathbf{z}) \to 0$ as $|\mathbf{x} - \mathbf{z}| \to \infty$
Locality: Points far apart have near-zero kernel value; the SVM "forgets" distant points when making local predictions.
Universal approximation: The RBF kernel can approximate any continuous function on a compact domain with enough data.

The RBF Kernel's Infinite-Dimensional Feature Space

The RBF kernel corresponds to an infinite-dimensional feature space. Using Taylor expansion: $\exp(x) = \sum_{k=0}^{\infty} x^k/k!$, the RBF kernel can be shown to include monomials of ALL degrees. No explicit representation is possible—yet the kernel trick lets us compute inner products in this space directly!

The $\gamma$ Parameter: The Critical Choice

$\gamma$ controls the "reach" of each training point's influence:

Small $\gamma$ (large $\sigma$):

Kernel values decay slowly with distance
Each point influences a large region
Decision boundary is smooth, global
Risk: Underfitting (too smooth to capture patterns)

Large $\gamma$ (small $\sigma$):

Kernel values decay rapidly with distance
Each point influences only its immediate neighborhood
Decision boundary is wiggly, local
Risk: Overfitting (memorizing training points)

Rule of thumb: Start with $\gamma = 1/d$ (where $d$ = number of features) or $\gamma = 1/(d \cdot \text{Var}(X))$, then tune via cross-validation.

Effect of $\gamma$ on RBF SVM Behavior
Aspect	Small $\gamma$ (e.g., 0.001)	Medium $\gamma$ (e.g., 0.1)	Large $\gamma$ (e.g., 10)
Kernel decay	Very slow	Moderate	Very fast
Influence radius	Large	Medium	Small
Decision boundary	Very smooth	Balanced	Highly irregular
Support vectors	Few	Moderate	Many (up to all points)
Training speed	Fast	Normal	Can be slow
Generalization risk	Underfitting	Good balance	Overfitting

rbf_kernel_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_kernel(x: np.ndarray, z: np.ndarray, gamma: float = 1.0) -> float:
    """
    RBF (Gaussian) kernel: k(x,z) = exp(-gamma * ||x-z||^2)
    
    Parameters:
        x, z: Input vectors
        gamma: Width parameter (1 / 2*sigma^2)
    
    Returns:
        Kernel value in (0, 1]
    """
    diff = x - z
    return np.exp(-gamma * np.dot(diff, diff))
 
def rbf_kernel_matrix(X: np.ndarray, gamma: float = 1.0) -> np.ndarray:
    """Compute the RBF kernel matrix for dataset X."""
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(i, n):
            K[i, j] = rbf_kernel(X[i], X[j], gamma)
            K[j, i] = K[i, j]
    return K
 
# Demonstrate gamma's effect on kernel values
x = np.array([0.0, 0.0])
distances = np.linspace(0, 5, 100)
 
print("RBF kernel value decay with distance:")
print("-" * 60)
for gamma in [0.1, 0.5, 1.0, 2.0, 5.0]:
    sigma = np.sqrt(1 / (2 * gamma))
    print(f"\ngamma = {gamma} (sigma = {sigma:.2f}):")
    for dist in [0.5, 1.0, 2.0, 3.0, 5.0]:
        z = np.array([dist, 0.0])
        k_val = rbf_kernel(x, z, gamma)
        print(f"  dist = {dist}: k = {k_val:.4f}")
 
# Effective neighborhood size
print("\nEffective neighbor radius (kernel > 0.1):")
for gamma in [0.01, 0.1, 1.0, 10.0, 100.0]:
    # k = exp(-gamma * r^2) = 0.1 => r = sqrt(-ln(0.1)/gamma)
    radius = np.sqrt(-np.log(0.1) / gamma)
    print(f"  gamma = {gamma:6.2f}: radius = {radius:.2f}")
 
# Kernel matrix properties for different gamma
np.random.seed(42)
X = np.random.randn(100, 5)  # 100 points in 5D
 
print("\nKernel matrix properties:")
print("-" * 60)
for gamma in [0.01, 0.1, 1.0, 10.0]:
    K = rbf_kernel_matrix(X, gamma)
    
    # Sparsity: fraction of kernel values < 0.01
    sparse_frac = np.mean(K < 0.01)
    
    # Condition number: indicates numerical stability
    eigvals = np.linalg.eigvalsh(K)
    cond = eigvals.max() / max(eigvals.min(), 1e-10)
    
    print(f"gamma = {gamma:5.2f}: sparsity = {sparse_frac:.1%}, "
          f"condition = {cond:.2e}")
 
# The connection to bandwidth interpretation
print("\nBandwidth interpretation of gamma:")
print("gamma = 1/(2*sigma^2) where sigma is the 'width' of influence")
for sigma in [0.1, 0.5, 1.0, 2.0, 5.0]:
    gamma = 1 / (2 * sigma**2)
    print(f"  sigma = {sigma}: gamma = {gamma:.4f}")

When to Use RBF Kernel

•Default choice: When you don't know the data structure, RBF is a reasonable first try.
•Nonlinear patterns: When data is clearly not linearly separable.
•Moderate dataset size: RBF works best for thousands to tens of thousands of samples.
•Continuous features: RBF assumes Euclidean distance is meaningful.
•When you can tune $\gamma$: The success of RBF SVM heavily depends on $\gamma$ selection.

RBF Kernel: Pros and Cons
Advantages	Disadvantages
Universal approximator (can fit any pattern)	Requires careful tuning of $\gamma$
Only one hyperparameter to tune	Can easily overfit if $\gamma$ too large
Works well across many domains	Less interpretable than linear/polynomial
Handles complex decision boundaries	Computationally expensive for large n
No numerical issues (bounded output)	Sensitive to feature scaling

The Sigmoid Kernel

The sigmoid kernel has historical importance due to its connection to neural networks, but it's used less frequently in modern practice.

Definition

$$k(\mathbf{x}, \mathbf{z}) = \tanh(\gamma , \mathbf{x}^T\mathbf{z} + c)$$

where:

$\gamma$: Scaling parameter (often small positive value)
$c$: Offset parameter (can be negative)
$\tanh$: Hyperbolic tangent function

Connection to Neural Networks

A two-layer neural network with $n$ hidden units and tanh activation: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i \tanh(\mathbf{v}_i^T\mathbf{x} + c_i) + b$$

If we constrain the hidden weights to be training examples ($\mathbf{v}_i = \mathbf{x}_i$), this matches the SVM decision function with sigmoid kernel!

This connection motivated early interest in kernel SVMs as "equivalent" to neural networks, though modern deep learning has shown neural networks' true power comes from learned hierarchical representations.

Not Always a Valid Kernel

Critical warning: The sigmoid kernel is NOT positive definite for all choices of $\gamma$ and $c$. This means the optimization problem may not be convex, and the solution may not be unique or stable. It satisfies Mercer's condition only for certain parameter ranges (roughly, $\gamma > 0$ and $c < 0$).

Properties and Behavior

Output range: $-1 < k(\mathbf{x}, \mathbf{z}) < 1$

Unlike RBF (always positive) or polynomial (can be arbitrarily large), the sigmoid kernel output is bounded and can be negative.

Parameter effects:

Large $\gamma$: Steep transition, more like a sign function
Small $\gamma$: Smooth, gradual transition
$c < 0$: Shifts the transition point

Feature Space Interpretation

Unlike RBF and polynomial kernels, the sigmoid kernel does not have a clean interpretation as an inner product in a finite-dimensional or well-characterized infinite-dimensional space. It's better understood through its neural network connection.

Sigmoid Kernel: Pros and Cons
Advantages	Disadvantages
Neural network connection	Not always a valid kernel (Mercer)
Similar behavior to multilayer perceptron	Optimization may be non-convex
Historical importance in SVM literature	Rarely the best choice in practice
Bounded output values	Less stable than RBF or polynomial
Two parameters for tuning	Harder to interpret geometrically

Modern Perspective

In modern practice, the sigmoid kernel is rarely used. If you want neural network-like behavior, it's usually better to use an actual neural network. For kernel SVM, RBF or polynomial kernels typically outperform sigmoid. The sigmoid kernel remains of theoretical and historical interest.

Comparing Kernels: A Unified View

Let's consolidate our understanding by comparing all four kernels across multiple dimensions.

Summary Table

Comprehensive Kernel Comparison
Property	Linear	Polynomial	RBF	Sigmoid
Formula	$\mathbf{x}^T\mathbf{z}$	$(\gamma\mathbf{x}^T\mathbf{z}+c)^p$	$\exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$	$\tanh(\gamma\mathbf{x}^T\mathbf{z}+c)$
Feature dim	$d$	$\binom{d+p}{p}$	$\infty$	Not well-defined
Hyperparameters	None	$(\gamma, c, p)$	$\gamma$	$(\gamma, c)$
Output range	$(-\infty, \infty)$	$(-\infty, \infty)$	$(0, 1]$	$(-1, 1)$
Mercer valid?	Yes	Yes	Yes	Conditional
Computation	$O(d)$	$O(d)$	$O(d)$	$O(d)$
Locality	Global	Global	Local	Mixed

Decision Boundary Characteristics

Linear:

Hyperplane (straight line in 2D)
Same orientation everywhere
Simplest possible boundary

Polynomial:

Polynomial curve of degree $p$
Global structure (defined by polynomial coefficients)
Smooth, predictable curvature

RBF:

Arbitrary shape, locally determined
Can have multiple disconnected regions
Smoothness controlled by $\gamma$

Sigmoid:

Similar to two-layer neural network
Can approximate smooth boundaries
Less predictable than others

kernel_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
 
def compare_kernels(X, y, name="Dataset"):
    """
    Compare kernel performance on a dataset.
    """
    # Scale features (important for RBF)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    kernels = [
        ('Linear', 'linear', {}),
        ('Poly (d=2)', 'poly', {'degree': 2, 'coef0': 1}),
        ('Poly (d=3)', 'poly', {'degree': 3, 'coef0': 1}),
        ('RBF (gamma=0.1)', 'rbf', {'gamma': 0.1}),
        ('RBF (gamma=1.0)', 'rbf', {'gamma': 1.0}),
        ('RBF (gamma=10)', 'rbf', {'gamma': 10}),
        ('Sigmoid', 'sigmoid', {'gamma': 0.1, 'coef0': 0}),
    ]
    
    print(f"\n{name} Results:")
    print("-" * 60)
    
    for name_k, kernel, params in kernels:
        svm = SVC(kernel=kernel, **params, C=1.0)
        scores = cross_val_score(svm, X_scaled, y, cv=5)
        print(f"  {name_k:20s}: {scores.mean():.3f} ± {scores.std():.3f}")
 
# Generate different dataset types
np.random.seed(42)
 
# 1. Linearly separable
X_linear, y_linear = make_classification(n_samples=200, n_features=2, 
                                          n_informative=2, n_redundant=0,
                                          n_clusters_per_class=1, random_state=42)
compare_kernels(X_linear, y_linear, "Linearly Separable")
 
# 2. Moons (nonlinear, smooth boundary)
X_moons, y_moons = make_moons(n_samples=200, noise=0.2, random_state=42)
compare_kernels(X_moons, y_moons, "Moons (Nonlinear)")
 
# 3. Circles (one class inside another)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, 
                                     factor=0.3, random_state=42)
compare_kernels(X_circles, y_circles, "Concentric Circles")
 
# 4. XOR pattern
n = 100
X_xor = np.random.randn(4*n, 2) * 0.4
X_xor[:n] += np.array([1, 1])
X_xor[n:2*n] += np.array([-1, -1])
X_xor[2*n:3*n] += np.array([1, -1])
X_xor[3*n:] += np.array([-1, 1])
y_xor = np.array([1]*n + [1]*n + [-1]*n + [-1]*n)
compare_kernels(X_xor, y_xor, "XOR Pattern")
 
print("\n" + "=" * 60)
print("Key observations:")
print("- Linear works well on linearly separable data")
print("- Polynomial captures smooth nonlinear boundaries")
print("- RBF adapts to complex local patterns")
print("- RBF gamma must be tuned: too high = overfit, too low = underfit")

Practical Guidelines for Kernel Selection

Based on decades of practical experience, here are guidelines for choosing kernels.

The Decision Flowchart

Is the data high-dimensional and sparse (e.g., text)? → Start with Linear kernel (often sufficient, very fast)
Is the number of features >> number of samples? → Try Linear kernel first (avoids overfitting)
Do you expect smooth polynomial relationships? → Try Polynomial kernel with low degree (2 or 3)
Is the pattern localized or irregular? → Try RBF kernel (most flexible)
Not sure? Default choice: → RBF kernel with tuned $\gamma$ (rarely a bad choice)

Domain-Specific Recommendations

•Text classification: Linear kernel (bag-of-words is already high-dim)
•Image classification: RBF or polynomial (for pixel features); linear (for extracted features)
•Bioinformatics: Often domain-specific string/sequence kernels
•Financial data: RBF with careful $\gamma$ tuning
•Small datasets (<1000 samples): RBF or polynomial can be tuned effectively
•Large datasets (>100k samples): Linear kernel (or approximate methods for RBF)

The Importance of Feature Scaling

RBF kernel is extremely sensitive to feature scaling! Always standardize or normalize features before using RBF. Without scaling, features with large ranges dominate the distance computation. This is less critical for linear and polynomial kernels but still good practice.

Hyperparameter Search Strategy

For RBF kernel, typical grid search ranges:

$C$: ${10^{-3}, 10^{-2}, \ldots, 10^3}$
$\gamma$: ${10^{-3}, 10^{-2}, \ldots, 10^2}$ (or 'scale' = $1/(d \cdot \text{Var}(X))$)

For polynomial kernel:

$C$: Same as above
Degree: ${2, 3, 4, 5}$ (rarely go higher)
$c$: ${0, 0.1, 1}$

Practical tip: Start with a coarse grid, then refine around the best region. Use cross-validation (5-fold is standard) to evaluate each configuration.

When to Give Up on SVMs

Dataset is very large ($n > 100,000$) and you need the best accuracy → Consider neural networks or gradient boosting
You need probability estimates → Logistic regression or calibrated SVM
Features are very high-dimensional and sparse → Linear SVM or regularized logistic regression
You need real-time, individual predictions with low latency → Precompute or use simpler models

Summary: The Kernel Toolkit

We've thoroughly explored the four foundational kernels that power most practical applications of kernel SVMs.

Key Takeaways

•Linear kernel: No parameters, fastest, best for high-dimensional/sparse data or when $d > n$.
•Polynomial kernel: Captures feature interactions, smooth boundaries, degree controls complexity.
•RBF kernel: Universal approximator, infinite-dimensional features, $\gamma$ is critical hyperparameter.
•Sigmoid kernel: Neural network connection but rarely the best choice; may violate Mercer's condition.
•Feature scaling is crucial: Especially for RBF; standardize features before kernel SVM.
•When in doubt, try RBF: With proper $\gamma$ tuning, it handles most nonlinear problems well.

What's Next

In the next page, we'll explore nonlinear decision boundaries in depth: how different kernels shape the decision surface, visualization techniques, understanding the bias-variance tradeoff in kernel space, and diagnosing underfitting vs overfitting.

Page Complete

You now have deep knowledge of the major kernel functions used in SVMs. You understand their mathematical forms, implicit feature spaces, hyperparameters, and practical tradeoffs—the foundation for effective kernel SVM applications.

Common Kernels

The Kernel Library

In this page, we conduct a deep dive into the four most influential kernel functions:

Linear kernel — The baseline; equivalent to the original SVM.
Polynomial kernel — Captures feature interactions up to a specified degree.
RBF (Gaussian) kernel — The universal default; infinite-dimensional feature space.
Sigmoid kernel — Historically important; connections to neural networks.

For each kernel, we'll examine its mathematical form, implicit feature space, hyperparameters, decision boundary characteristics, and practical considerations.

Learning Objectives

The Linear Kernel

The simplest kernel is no kernel at all—or more precisely, the kernel that corresponds to the identity feature map.

Definition

$$k(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z} = \sum_{i=1}^d x_i z_i$$

Implicit Feature Map

The identity: $\phi(\mathbf{x}) = \mathbf{x}$

The feature space is simply the original input space $\mathbb{R}^d$. There's no transformation; the linear kernel is what we get if we run standard SVM without any kernel at all.

Hyperparameters

None. The linear kernel has no tunable parameters.

Decision Boundary

The decision boundary is a hyperplane in the original feature space: $$\mathbf{w}^T\mathbf{x} + b = 0$$

In 2D, this is a straight line. In 3D, a plane. In higher dimensions, a hyperplane.

When to Use Linear Kernels

Linear Kernel: Pros and Cons
Advantages	Disadvantages
Fast training (primal solvers available)	Cannot capture nonlinear relationships
Fast prediction: $O(d)$ per point	May underfit complex data
No hyperparameter tuning needed	Decision boundary limited to hyperplane
Scales well to large datasets	Poor for problems with feature interactions
Interpretable weights	Assumes linearity in input space

Practical Note: Linear SVM Optimizations

For linear kernels, specialized algorithms exist that are much faster than general kernel SVM:

LIBLINEAR: Coordinate descent in primal form; $O(nd)$ per pass.
Pegasos: Stochastic gradient descent; $O(d)$ per sample update.
Dual coordinate descent: Efficient for primal-dual conversion.

These methods avoid forming the $n \times n$ kernel matrix entirely, enabling training on millions of samples.

The Polynomial Kernel

The polynomial kernel explicitly models feature interactions up to a specified degree, enabling polynomial decision boundaries.

Definition

$$k(\mathbf{x}, \mathbf{z}) = (\gamma , \mathbf{x}^T\mathbf{z} + c)^p$$

Alternative parameterization (common in sklearn): $$k(\mathbf{x}, \mathbf{z}) = (\gamma , \mathbf{x}^T\mathbf{z} + r)^d$$

where:

$p$ (or $d$): Polynomial degree (integer ≥ 1)
$c$ (or $r$): Offset parameter (called coef0); $c \geq 0$
$\gamma$: Scaling parameter; often set to $1/d$ (number of features)

Implicit Feature Space

The polynomial kernel corresponds to a feature space containing all monomials up to degree $p$.

Example: For $\mathbf{x} = (x_1, x_2)$ and degree $p = 2$ with $c = 1$: $$k(\mathbf{x}, \mathbf{z}) = (x_1 z_1 + x_2 z_2 + 1)^2$$

Expanding: $$= x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1 x_2 z_1 z_2 + 2x_1 z_1 + 2x_2 z_2 + 1$$

This corresponds to the feature map: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2}x_1, \sqrt{2}x_2, 1)$$

Six features from two original features!

Feature Space Dimension

The Role of Hyperparameters

Degree $p$:

$p = 1$: Equivalent to linear kernel (if $\gamma = 1, c = 0$)
$p = 2$: Quadratic; captures pairwise interactions
$p = 3$: Cubic; captures three-way interactions
Higher $p$: More complex boundaries, higher overfitting risk

Offset $c$:

$c = 0$ (homogeneous polynomial): Only degree-$p$ monomials
$c > 0$ (inhomogeneous): Includes monomials of all degrees ≤ $p$
Larger $c$: More weight on lower-degree terms

Scaling $\gamma$:

Default often $\gamma = 1/d$
Larger $\gamma$: Amplifies dot product, more sensitive to individual features
Smaller $\gamma$: Dampens influence, smoother decision boundary

Decision Boundary

Polynomial kernels produce polynomial decision boundaries:

Degree 2: Conic sections (ellipses, hyperbolas, parabolas)
Degree 3: Cubic curves
Higher degrees: Increasingly complex polynomial curves

polynomial_kernel_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from math import comb
 
def polynomial_kernel(x: np.ndarray, z: np.ndarray,
                      degree: int = 3, gamma: float = None, 
                      coef0: float = 1.0) -> float:
    """
    Polynomial kernel: k(x,z) = (gamma * x^T z + coef0)^degree
    
    Parameters:
        x, z: Input vectors
        degree: Polynomial degree (default 3)
        gamma: Scaling factor (default 1/n_features)
        coef0: Offset parameter (default 1.0)
    """
    if gamma is None:
        gamma = 1.0 / len(x)
    return (gamma * np.dot(x, z) + coef0) ** degree
 
def explicit_degree2_features(x: np.ndarray, c: float = 1.0) -> np.ndarray:
    """
    Explicitly compute degree-2 polynomial features for 2D input.
    
    Demonstrates what the polynomial kernel implicitly computes.
    """
    x1, x2 = x
    sqrt_2 = np.sqrt(2)
    sqrt_c = np.sqrt(c)
    sqrt_2c = np.sqrt(2 * c)
    
    # Features corresponding to (x1*z1 + x2*z2 + c)^2
    return np.array([
        x1**2,           # x1^2 term
        x2**2,           # x2^2 term
        sqrt_2 * x1 * x2,  # 2*x1*x2 cross term (sqrt(2) for normalization)
        sqrt_2c * x1,    # 2c*x1 term (linear)
        sqrt_2c * x2,    # 2c*x2 term (linear)
        c                # c^2 constant term (just c since sqrt(c)^2*sqrt(c)^2=c^2... wait)
        # Actually: the constant should be sqrt(c)*sqrt(c) = c for inner product = c^2
        # Let me recalculate properly...
    ])
 
# Verify kernel trick for polynomial
x = np.array([1.0, 2.0])
z = np.array([3.0, 4.0])
c = 1.0
 
# Direct kernel computation
kernel_value = polynomial_kernel(x, z, degree=2, gamma=1.0, coef0=c)
 
# Expand and verify
# (x1*z1 + x2*z2 + c)^2 = 
# x1^2*z1^2 + x2^2*z2^2 + c^2 + 2*x1*z1*x2*z2 + 2*c*x1*z1 + 2*c*x2*z2
expanded = (x[0]**2 * z[0]**2 + x[1]**2 * z[1]**2 + c**2 + 
            2*x[0]*z[0]*x[1]*z[1] + 2*c*x[0]*z[0] + 2*c*x[1]*z[1])
 
print("Polynomial kernel verification:")
print(f"  Direct: (x·z + c)^2 = ({np.dot(x,z)} + {c})^2 = {kernel_value}")
print(f"  Expanded: {expanded}")
print(f"  Match: {np.isclose(kernel_value, expanded)}")
 
# Feature space dimension growth
print("\nFeature space dimensions for polynomial kernels:")
print("-" * 50)
for d in [2, 10, 50, 100]:
    for p in [2, 3, 5]:
        dim = comb(d + p, p)
        print(f"  Original d={d:3d}, degree p={p}: {dim:>15,} features")
        
# Computational comparison
print("\nComputational comparison (d=100, degree 5):")
d, p = 100, 5
kernel_ops = d + 1  # dot product + power
explicit_dim = comb(d + p, p)
print(f"  Kernel computation: O({kernel_ops}) operations")
print(f"  Explicit features: O({explicit_dim:,}) operations to compute")
print(f"  Speedup factor: {explicit_dim / kernel_ops:,.0f}x")

Polynomial Kernel: Pros and Cons
Advantages	Disadvantages
Captures feature interactions naturally	Degree selection can be difficult
Smooth, interpretable decision boundaries	Higher degrees risk overfitting
Works well for normalized features	Computationally slower than linear
Good for image recognition tasks	Numerical issues with high degrees
Explicit feature interpretation possible	May struggle with very complex patterns

The RBF (Gaussian) Kernel

The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is the most widely used kernel in practice. It's the default choice for kernel SVM and offers extraordinary flexibility.

Definition

$$k(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma |\mathbf{x} - \mathbf{z}|^2\right) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right)$$

where:

$\gamma = \frac{1}{2\sigma^2}$: Width parameter (inverse squared bandwidth)
$\sigma$: Bandwidth parameter (standard deviation of Gaussian)

The kernel value depends only on the distance between points, not their absolute positions—hence "radial basis function."

Key Properties

Bounded output: $0 < k(\mathbf{x}, \mathbf{z}) \leq 1$
- $k(\mathbf{x}, \mathbf{x}) = 1$ (self-similarity)
- $k(\mathbf{x}, \mathbf{z}) \to 0$ as $|\mathbf{x} - \mathbf{z}| \to \infty$
Locality: Points far apart have near-zero kernel value; the SVM "forgets" distant points when making local predictions.
Universal approximation: The RBF kernel can approximate any continuous function on a compact domain with enough data.

The RBF Kernel's Infinite-Dimensional Feature Space

The $\gamma$ Parameter: The Critical Choice

$\gamma$ controls the "reach" of each training point's influence:

Small $\gamma$ (large $\sigma$):

Kernel values decay slowly with distance
Each point influences a large region
Decision boundary is smooth, global
Risk: Underfitting (too smooth to capture patterns)

Large $\gamma$ (small $\sigma$):

Kernel values decay rapidly with distance
Each point influences only its immediate neighborhood
Decision boundary is wiggly, local
Risk: Overfitting (memorizing training points)

Rule of thumb: Start with $\gamma = 1/d$ (where $d$ = number of features) or $\gamma = 1/(d \cdot \text{Var}(X))$, then tune via cross-validation.

Effect of $\gamma$ on RBF SVM Behavior
Aspect	Small $\gamma$ (e.g., 0.001)	Medium $\gamma$ (e.g., 0.1)	Large $\gamma$ (e.g., 10)
Kernel decay	Very slow	Moderate	Very fast
Influence radius	Large	Medium	Small
Decision boundary	Very smooth	Balanced	Highly irregular
Support vectors	Few	Moderate	Many (up to all points)
Training speed	Fast	Normal	Can be slow
Generalization risk	Underfitting	Good balance	Overfitting

rbf_kernel_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_kernel(x: np.ndarray, z: np.ndarray, gamma: float = 1.0) -> float:
    """
    RBF (Gaussian) kernel: k(x,z) = exp(-gamma * ||x-z||^2)
    
    Parameters:
        x, z: Input vectors
        gamma: Width parameter (1 / 2*sigma^2)
    
    Returns:
        Kernel value in (0, 1]
    """
    diff = x - z
    return np.exp(-gamma * np.dot(diff, diff))
 
def rbf_kernel_matrix(X: np.ndarray, gamma: float = 1.0) -> np.ndarray:
    """Compute the RBF kernel matrix for dataset X."""
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(i, n):
            K[i, j] = rbf_kernel(X[i], X[j], gamma)
            K[j, i] = K[i, j]
    return K
 
# Demonstrate gamma's effect on kernel values
x = np.array([0.0, 0.0])
distances = np.linspace(0, 5, 100)
 
print("RBF kernel value decay with distance:")
print("-" * 60)
for gamma in [0.1, 0.5, 1.0, 2.0, 5.0]:
    sigma = np.sqrt(1 / (2 * gamma))
    print(f"\ngamma = {gamma} (sigma = {sigma:.2f}):")
    for dist in [0.5, 1.0, 2.0, 3.0, 5.0]:
        z = np.array([dist, 0.0])
        k_val = rbf_kernel(x, z, gamma)
        print(f"  dist = {dist}: k = {k_val:.4f}")
 
# Effective neighborhood size
print("\nEffective neighbor radius (kernel > 0.1):")
for gamma in [0.01, 0.1, 1.0, 10.0, 100.0]:
    # k = exp(-gamma * r^2) = 0.1 => r = sqrt(-ln(0.1)/gamma)
    radius = np.sqrt(-np.log(0.1) / gamma)
    print(f"  gamma = {gamma:6.2f}: radius = {radius:.2f}")
 
# Kernel matrix properties for different gamma
np.random.seed(42)
X = np.random.randn(100, 5)  # 100 points in 5D
 
print("\nKernel matrix properties:")
print("-" * 60)
for gamma in [0.01, 0.1, 1.0, 10.0]:
    K = rbf_kernel_matrix(X, gamma)
    
    # Sparsity: fraction of kernel values < 0.01
    sparse_frac = np.mean(K < 0.01)
    
    # Condition number: indicates numerical stability
    eigvals = np.linalg.eigvalsh(K)
    cond = eigvals.max() / max(eigvals.min(), 1e-10)
    
    print(f"gamma = {gamma:5.2f}: sparsity = {sparse_frac:.1%}, "
          f"condition = {cond:.2e}")
 
# The connection to bandwidth interpretation
print("\nBandwidth interpretation of gamma:")
print("gamma = 1/(2*sigma^2) where sigma is the 'width' of influence")
for sigma in [0.1, 0.5, 1.0, 2.0, 5.0]:
    gamma = 1 / (2 * sigma**2)
    print(f"  sigma = {sigma}: gamma = {gamma:.4f}")

When to Use RBF Kernel

•Default choice: When you don't know the data structure, RBF is a reasonable first try.
•Nonlinear patterns: When data is clearly not linearly separable.
•Moderate dataset size: RBF works best for thousands to tens of thousands of samples.
•Continuous features: RBF assumes Euclidean distance is meaningful.
•When you can tune $\gamma$: The success of RBF SVM heavily depends on $\gamma$ selection.

RBF Kernel: Pros and Cons
Advantages	Disadvantages
Universal approximator (can fit any pattern)	Requires careful tuning of $\gamma$
Only one hyperparameter to tune	Can easily overfit if $\gamma$ too large
Works well across many domains	Less interpretable than linear/polynomial
Handles complex decision boundaries	Computationally expensive for large n
No numerical issues (bounded output)	Sensitive to feature scaling

The Sigmoid Kernel

The sigmoid kernel has historical importance due to its connection to neural networks, but it's used less frequently in modern practice.

Definition

$$k(\mathbf{x}, \mathbf{z}) = \tanh(\gamma , \mathbf{x}^T\mathbf{z} + c)$$

where:

$\gamma$: Scaling parameter (often small positive value)
$c$: Offset parameter (can be negative)
$\tanh$: Hyperbolic tangent function

Connection to Neural Networks

A two-layer neural network with $n$ hidden units and tanh activation: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i \tanh(\mathbf{v}_i^T\mathbf{x} + c_i) + b$$

If we constrain the hidden weights to be training examples ($\mathbf{v}_i = \mathbf{x}_i$), this matches the SVM decision function with sigmoid kernel!

Not Always a Valid Kernel

Properties and Behavior

Output range: $-1 < k(\mathbf{x}, \mathbf{z}) < 1$

Unlike RBF (always positive) or polynomial (can be arbitrarily large), the sigmoid kernel output is bounded and can be negative.

Parameter effects:

Large $\gamma$: Steep transition, more like a sign function
Small $\gamma$: Smooth, gradual transition
$c < 0$: Shifts the transition point

Feature Space Interpretation

Sigmoid Kernel: Pros and Cons
Advantages	Disadvantages
Neural network connection	Not always a valid kernel (Mercer)
Similar behavior to multilayer perceptron	Optimization may be non-convex
Historical importance in SVM literature	Rarely the best choice in practice
Bounded output values	Less stable than RBF or polynomial
Two parameters for tuning	Harder to interpret geometrically

Modern Perspective

Comparing Kernels: A Unified View

Let's consolidate our understanding by comparing all four kernels across multiple dimensions.

Summary Table

Comprehensive Kernel Comparison
Property	Linear	Polynomial	RBF	Sigmoid
Formula	$\mathbf{x}^T\mathbf{z}$	$(\gamma\mathbf{x}^T\mathbf{z}+c)^p$	$\exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$	$\tanh(\gamma\mathbf{x}^T\mathbf{z}+c)$
Feature dim	$d$	$\binom{d+p}{p}$	$\infty$	Not well-defined
Hyperparameters	None	$(\gamma, c, p)$	$\gamma$	$(\gamma, c)$
Output range	$(-\infty, \infty)$	$(-\infty, \infty)$	$(0, 1]$	$(-1, 1)$
Mercer valid?	Yes	Yes	Yes	Conditional
Computation	$O(d)$	$O(d)$	$O(d)$	$O(d)$
Locality	Global	Global	Local	Mixed

Decision Boundary Characteristics

Linear:

Hyperplane (straight line in 2D)
Same orientation everywhere
Simplest possible boundary

Polynomial:

Polynomial curve of degree $p$
Global structure (defined by polynomial coefficients)
Smooth, predictable curvature

RBF:

Arbitrary shape, locally determined
Can have multiple disconnected regions
Smoothness controlled by $\gamma$

Sigmoid:

Similar to two-layer neural network
Can approximate smooth boundaries
Less predictable than others

kernel_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
 
def compare_kernels(X, y, name="Dataset"):
    """
    Compare kernel performance on a dataset.
    """
    # Scale features (important for RBF)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    kernels = [
        ('Linear', 'linear', {}),
        ('Poly (d=2)', 'poly', {'degree': 2, 'coef0': 1}),
        ('Poly (d=3)', 'poly', {'degree': 3, 'coef0': 1}),
        ('RBF (gamma=0.1)', 'rbf', {'gamma': 0.1}),
        ('RBF (gamma=1.0)', 'rbf', {'gamma': 1.0}),
        ('RBF (gamma=10)', 'rbf', {'gamma': 10}),
        ('Sigmoid', 'sigmoid', {'gamma': 0.1, 'coef0': 0}),
    ]
    
    print(f"\n{name} Results:")
    print("-" * 60)
    
    for name_k, kernel, params in kernels:
        svm = SVC(kernel=kernel, **params, C=1.0)
        scores = cross_val_score(svm, X_scaled, y, cv=5)
        print(f"  {name_k:20s}: {scores.mean():.3f} ± {scores.std():.3f}")
 
# Generate different dataset types
np.random.seed(42)
 
# 1. Linearly separable
X_linear, y_linear = make_classification(n_samples=200, n_features=2, 
                                          n_informative=2, n_redundant=0,
                                          n_clusters_per_class=1, random_state=42)
compare_kernels(X_linear, y_linear, "Linearly Separable")
 
# 2. Moons (nonlinear, smooth boundary)
X_moons, y_moons = make_moons(n_samples=200, noise=0.2, random_state=42)
compare_kernels(X_moons, y_moons, "Moons (Nonlinear)")
 
# 3. Circles (one class inside another)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, 
                                     factor=0.3, random_state=42)
compare_kernels(X_circles, y_circles, "Concentric Circles")
 
# 4. XOR pattern
n = 100
X_xor = np.random.randn(4*n, 2) * 0.4
X_xor[:n] += np.array([1, 1])
X_xor[n:2*n] += np.array([-1, -1])
X_xor[2*n:3*n] += np.array([1, -1])
X_xor[3*n:] += np.array([-1, 1])
y_xor = np.array([1]*n + [1]*n + [-1]*n + [-1]*n)
compare_kernels(X_xor, y_xor, "XOR Pattern")
 
print("\n" + "=" * 60)
print("Key observations:")
print("- Linear works well on linearly separable data")
print("- Polynomial captures smooth nonlinear boundaries")
print("- RBF adapts to complex local patterns")
print("- RBF gamma must be tuned: too high = overfit, too low = underfit")

Practical Guidelines for Kernel Selection

Based on decades of practical experience, here are guidelines for choosing kernels.

The Decision Flowchart

Is the data high-dimensional and sparse (e.g., text)? → Start with Linear kernel (often sufficient, very fast)
Is the number of features >> number of samples? → Try Linear kernel first (avoids overfitting)
Do you expect smooth polynomial relationships? → Try Polynomial kernel with low degree (2 or 3)
Is the pattern localized or irregular? → Try RBF kernel (most flexible)
Not sure? Default choice: → RBF kernel with tuned $\gamma$ (rarely a bad choice)

Domain-Specific Recommendations

•Text classification: Linear kernel (bag-of-words is already high-dim)
•Image classification: RBF or polynomial (for pixel features); linear (for extracted features)
•Bioinformatics: Often domain-specific string/sequence kernels
•Financial data: RBF with careful $\gamma$ tuning
•Small datasets (<1000 samples): RBF or polynomial can be tuned effectively
•Large datasets (>100k samples): Linear kernel (or approximate methods for RBF)

The Importance of Feature Scaling

Hyperparameter Search Strategy

For RBF kernel, typical grid search ranges:

$C$: ${10^{-3}, 10^{-2}, \ldots, 10^3}$
$\gamma$: ${10^{-3}, 10^{-2}, \ldots, 10^2}$ (or 'scale' = $1/(d \cdot \text{Var}(X))$)

For polynomial kernel:

$C$: Same as above
Degree: ${2, 3, 4, 5}$ (rarely go higher)
$c$: ${0, 0.1, 1}$

Practical tip: Start with a coarse grid, then refine around the best region. Use cross-validation (5-fold is standard) to evaluate each configuration.

When to Give Up on SVMs

Dataset is very large ($n > 100,000$) and you need the best accuracy → Consider neural networks or gradient boosting
You need probability estimates → Logistic regression or calibrated SVM
Features are very high-dimensional and sparse → Linear SVM or regularized logistic regression
You need real-time, individual predictions with low latency → Precompute or use simpler models

Summary: The Kernel Toolkit

We've thoroughly explored the four foundational kernels that power most practical applications of kernel SVMs.

Key Takeaways

•Linear kernel: No parameters, fastest, best for high-dimensional/sparse data or when $d > n$.
•Polynomial kernel: Captures feature interactions, smooth boundaries, degree controls complexity.
•RBF kernel: Universal approximator, infinite-dimensional features, $\gamma$ is critical hyperparameter.
•Sigmoid kernel: Neural network connection but rarely the best choice; may violate Mercer's condition.
•Feature scaling is crucial: Especially for RBF; standardize features before kernel SVM.
•When in doubt, try RBF: With proper $\gamma$ tuning, it handles most nonlinear problems well.

What's Next

Page Complete

Common Kernels

Definition

Implicit Feature Map

Hyperparameters

Decision Boundary

Practical Note: Linear SVM Optimizations

Definition

Implicit Feature Space

The Role of Hyperparameters

Decision Boundary

Definition

Key Properties

The $\gamma$ Parameter: The Critical Choice

Support vectors

Definition

Connection to Neural Networks

Properties and Behavior

Feature Space Interpretation

Summary Table

Decision Boundary Characteristics

The Decision Flowchart

Hyperparameter Search Strategy

When to Give Up on SVMs

Common Kernels

Definition

Implicit Feature Map

Hyperparameters

Decision Boundary

Practical Note: Linear SVM Optimizations

Definition

Implicit Feature Space

The Role of Hyperparameters

Decision Boundary

Definition

Key Properties

The $\gamma$ Parameter: The Critical Choice

Support vectors

Definition

Connection to Neural Networks

Properties and Behavior

Feature Space Interpretation

Summary Table

Decision Boundary Characteristics

The Decision Flowchart

Hyperparameter Search Strategy

When to Give Up on SVMs