Loading content...
The kernel function is the core design choice in kernel SVMs—it determines the implicit feature space and thus the expressiveness of the learned classifier. While infinitely many valid kernels exist (and domain-specific kernels can be constructed for specialized applications), a small set of kernels handles the vast majority of practical problems.
In this page, we conduct a deep dive into the four most influential kernel functions:
For each kernel, we'll examine its mathematical form, implicit feature space, hyperparameters, decision boundary characteristics, and practical considerations.
By completing this page, you will: (1) Know the mathematical form and feature space of each major kernel, (2) Understand how hyperparameters affect decision boundaries, (3) Recognize when to use each kernel type, and (4) Appreciate the tradeoffs between expressiveness and overfitting risk.
The simplest kernel is no kernel at all—or more precisely, the kernel that corresponds to the identity feature map.
$$k(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z} = \sum_{i=1}^d x_i z_i$$
The identity: $\phi(\mathbf{x}) = \mathbf{x}$
The feature space is simply the original input space $\mathbb{R}^d$. There's no transformation; the linear kernel is what we get if we run standard SVM without any kernel at all.
None. The linear kernel has no tunable parameters.
The decision boundary is a hyperplane in the original feature space: $$\mathbf{w}^T\mathbf{x} + b = 0$$
In 2D, this is a straight line. In 3D, a plane. In higher dimensions, a hyperplane.
Linear kernels are surprisingly effective when: (1) The data is already high-dimensional (e.g., text with bag-of-words features), (2) The number of features exceeds the number of samples ($d > n$), (3) The problem is actually linearly separable or nearly so, (4) You need fast training and prediction.
| Advantages | Disadvantages |
|---|---|
| Fast training (primal solvers available) | Cannot capture nonlinear relationships |
| Fast prediction: $O(d)$ per point | May underfit complex data |
| No hyperparameter tuning needed | Decision boundary limited to hyperplane |
| Scales well to large datasets | Poor for problems with feature interactions |
| Interpretable weights | Assumes linearity in input space |
For linear kernels, specialized algorithms exist that are much faster than general kernel SVM:
These methods avoid forming the $n \times n$ kernel matrix entirely, enabling training on millions of samples.
For very high-dimensional sparse data (e.g., text classification), linear SVM is often the first algorithm to try—it's fast, scales well, and frequently performs competitively with nonlinear methods.
The polynomial kernel explicitly models feature interactions up to a specified degree, enabling polynomial decision boundaries.
$$k(\mathbf{x}, \mathbf{z}) = (\gamma , \mathbf{x}^T\mathbf{z} + c)^p$$
Alternative parameterization (common in sklearn): $$k(\mathbf{x}, \mathbf{z}) = (\gamma , \mathbf{x}^T\mathbf{z} + r)^d$$
where:
coef0); $c \geq 0$The polynomial kernel corresponds to a feature space containing all monomials up to degree $p$.
Example: For $\mathbf{x} = (x_1, x_2)$ and degree $p = 2$ with $c = 1$: $$k(\mathbf{x}, \mathbf{z}) = (x_1 z_1 + x_2 z_2 + 1)^2$$
Expanding: $$= x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1 x_2 z_1 z_2 + 2x_1 z_1 + 2x_2 z_2 + 1$$
This corresponds to the feature map: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2}x_1, \sqrt{2}x_2, 1)$$
Six features from two original features!
For degree $p$ and original dimension $d$, the feature space dimension is $\binom{d+p}{p}$. For $d=100$ and $p=3$: 176,851 features. For $p=5$: 96,560,646 features! The kernel trick makes this tractable by computing the inner product directly without forming the features.
Degree $p$:
Offset $c$:
Scaling $\gamma$:
Polynomial kernels produce polynomial decision boundaries:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom math import comb def polynomial_kernel(x: np.ndarray, z: np.ndarray, degree: int = 3, gamma: float = None, coef0: float = 1.0) -> float: """ Polynomial kernel: k(x,z) = (gamma * x^T z + coef0)^degree Parameters: x, z: Input vectors degree: Polynomial degree (default 3) gamma: Scaling factor (default 1/n_features) coef0: Offset parameter (default 1.0) """ if gamma is None: gamma = 1.0 / len(x) return (gamma * np.dot(x, z) + coef0) ** degree def explicit_degree2_features(x: np.ndarray, c: float = 1.0) -> np.ndarray: """ Explicitly compute degree-2 polynomial features for 2D input. Demonstrates what the polynomial kernel implicitly computes. """ x1, x2 = x sqrt_2 = np.sqrt(2) sqrt_c = np.sqrt(c) sqrt_2c = np.sqrt(2 * c) # Features corresponding to (x1*z1 + x2*z2 + c)^2 return np.array([ x1**2, # x1^2 term x2**2, # x2^2 term sqrt_2 * x1 * x2, # 2*x1*x2 cross term (sqrt(2) for normalization) sqrt_2c * x1, # 2c*x1 term (linear) sqrt_2c * x2, # 2c*x2 term (linear) c # c^2 constant term (just c since sqrt(c)^2*sqrt(c)^2=c^2... wait) # Actually: the constant should be sqrt(c)*sqrt(c) = c for inner product = c^2 # Let me recalculate properly... ]) # Verify kernel trick for polynomialx = np.array([1.0, 2.0])z = np.array([3.0, 4.0])c = 1.0 # Direct kernel computationkernel_value = polynomial_kernel(x, z, degree=2, gamma=1.0, coef0=c) # Expand and verify# (x1*z1 + x2*z2 + c)^2 = # x1^2*z1^2 + x2^2*z2^2 + c^2 + 2*x1*z1*x2*z2 + 2*c*x1*z1 + 2*c*x2*z2expanded = (x[0]**2 * z[0]**2 + x[1]**2 * z[1]**2 + c**2 + 2*x[0]*z[0]*x[1]*z[1] + 2*c*x[0]*z[0] + 2*c*x[1]*z[1]) print("Polynomial kernel verification:")print(f" Direct: (x·z + c)^2 = ({np.dot(x,z)} + {c})^2 = {kernel_value}")print(f" Expanded: {expanded}")print(f" Match: {np.isclose(kernel_value, expanded)}") # Feature space dimension growthprint("\nFeature space dimensions for polynomial kernels:")print("-" * 50)for d in [2, 10, 50, 100]: for p in [2, 3, 5]: dim = comb(d + p, p) print(f" Original d={d:3d}, degree p={p}: {dim:>15,} features") # Computational comparisonprint("\nComputational comparison (d=100, degree 5):")d, p = 100, 5kernel_ops = d + 1 # dot product + powerexplicit_dim = comb(d + p, p)print(f" Kernel computation: O({kernel_ops}) operations")print(f" Explicit features: O({explicit_dim:,}) operations to compute")print(f" Speedup factor: {explicit_dim / kernel_ops:,.0f}x")| Advantages | Disadvantages |
|---|---|
| Captures feature interactions naturally | Degree selection can be difficult |
| Smooth, interpretable decision boundaries | Higher degrees risk overfitting |
| Works well for normalized features | Computationally slower than linear |
| Good for image recognition tasks | Numerical issues with high degrees |
| Explicit feature interpretation possible | May struggle with very complex patterns |
The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is the most widely used kernel in practice. It's the default choice for kernel SVM and offers extraordinary flexibility.
$$k(\mathbf{x}, \mathbf{z}) = \exp\left(-\gamma |\mathbf{x} - \mathbf{z}|^2\right) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right)$$
where:
The kernel value depends only on the distance between points, not their absolute positions—hence "radial basis function."
Bounded output: $0 < k(\mathbf{x}, \mathbf{z}) \leq 1$
Locality: Points far apart have near-zero kernel value; the SVM "forgets" distant points when making local predictions.
Universal approximation: The RBF kernel can approximate any continuous function on a compact domain with enough data.
The RBF kernel corresponds to an infinite-dimensional feature space. Using Taylor expansion: $\exp(x) = \sum_{k=0}^{\infty} x^k/k!$, the RBF kernel can be shown to include monomials of ALL degrees. No explicit representation is possible—yet the kernel trick lets us compute inner products in this space directly!
$\gamma$ controls the "reach" of each training point's influence:
Small $\gamma$ (large $\sigma$):
Large $\gamma$ (small $\sigma$):
Rule of thumb: Start with $\gamma = 1/d$ (where $d$ = number of features) or $\gamma = 1/(d \cdot \text{Var}(X))$, then tune via cross-validation.
| Aspect | Small $\gamma$ (e.g., 0.001) | Medium $\gamma$ (e.g., 0.1) | Large $\gamma$ (e.g., 10) |
|---|---|---|---|
| Kernel decay | Very slow | Moderate | Very fast |
| Influence radius | Large | Medium | Small |
| Decision boundary | Very smooth | Balanced | Highly irregular |
Support vectors | Few | Moderate | Many (up to all points) |
| Training speed | Fast | Normal | Can be slow |
| Generalization risk | Underfitting | Good balance | Overfitting |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npimport matplotlib.pyplot as plt def rbf_kernel(x: np.ndarray, z: np.ndarray, gamma: float = 1.0) -> float: """ RBF (Gaussian) kernel: k(x,z) = exp(-gamma * ||x-z||^2) Parameters: x, z: Input vectors gamma: Width parameter (1 / 2*sigma^2) Returns: Kernel value in (0, 1] """ diff = x - z return np.exp(-gamma * np.dot(diff, diff)) def rbf_kernel_matrix(X: np.ndarray, gamma: float = 1.0) -> np.ndarray: """Compute the RBF kernel matrix for dataset X.""" n = X.shape[0] K = np.zeros((n, n)) for i in range(n): for j in range(i, n): K[i, j] = rbf_kernel(X[i], X[j], gamma) K[j, i] = K[i, j] return K # Demonstrate gamma's effect on kernel valuesx = np.array([0.0, 0.0])distances = np.linspace(0, 5, 100) print("RBF kernel value decay with distance:")print("-" * 60)for gamma in [0.1, 0.5, 1.0, 2.0, 5.0]: sigma = np.sqrt(1 / (2 * gamma)) print(f"\ngamma = {gamma} (sigma = {sigma:.2f}):") for dist in [0.5, 1.0, 2.0, 3.0, 5.0]: z = np.array([dist, 0.0]) k_val = rbf_kernel(x, z, gamma) print(f" dist = {dist}: k = {k_val:.4f}") # Effective neighborhood sizeprint("\nEffective neighbor radius (kernel > 0.1):")for gamma in [0.01, 0.1, 1.0, 10.0, 100.0]: # k = exp(-gamma * r^2) = 0.1 => r = sqrt(-ln(0.1)/gamma) radius = np.sqrt(-np.log(0.1) / gamma) print(f" gamma = {gamma:6.2f}: radius = {radius:.2f}") # Kernel matrix properties for different gammanp.random.seed(42)X = np.random.randn(100, 5) # 100 points in 5D print("\nKernel matrix properties:")print("-" * 60)for gamma in [0.01, 0.1, 1.0, 10.0]: K = rbf_kernel_matrix(X, gamma) # Sparsity: fraction of kernel values < 0.01 sparse_frac = np.mean(K < 0.01) # Condition number: indicates numerical stability eigvals = np.linalg.eigvalsh(K) cond = eigvals.max() / max(eigvals.min(), 1e-10) print(f"gamma = {gamma:5.2f}: sparsity = {sparse_frac:.1%}, " f"condition = {cond:.2e}") # The connection to bandwidth interpretationprint("\nBandwidth interpretation of gamma:")print("gamma = 1/(2*sigma^2) where sigma is the 'width' of influence")for sigma in [0.1, 0.5, 1.0, 2.0, 5.0]: gamma = 1 / (2 * sigma**2) print(f" sigma = {sigma}: gamma = {gamma:.4f}")| Advantages | Disadvantages |
|---|---|
| Universal approximator (can fit any pattern) | Requires careful tuning of $\gamma$ |
| Only one hyperparameter to tune | Can easily overfit if $\gamma$ too large |
| Works well across many domains | Less interpretable than linear/polynomial |
| Handles complex decision boundaries | Computationally expensive for large n |
| No numerical issues (bounded output) | Sensitive to feature scaling |
The sigmoid kernel has historical importance due to its connection to neural networks, but it's used less frequently in modern practice.
$$k(\mathbf{x}, \mathbf{z}) = \tanh(\gamma , \mathbf{x}^T\mathbf{z} + c)$$
where:
A two-layer neural network with $n$ hidden units and tanh activation: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i \tanh(\mathbf{v}_i^T\mathbf{x} + c_i) + b$$
If we constrain the hidden weights to be training examples ($\mathbf{v}_i = \mathbf{x}_i$), this matches the SVM decision function with sigmoid kernel!
This connection motivated early interest in kernel SVMs as "equivalent" to neural networks, though modern deep learning has shown neural networks' true power comes from learned hierarchical representations.
Critical warning: The sigmoid kernel is NOT positive definite for all choices of $\gamma$ and $c$. This means the optimization problem may not be convex, and the solution may not be unique or stable. It satisfies Mercer's condition only for certain parameter ranges (roughly, $\gamma > 0$ and $c < 0$).
Output range: $-1 < k(\mathbf{x}, \mathbf{z}) < 1$
Unlike RBF (always positive) or polynomial (can be arbitrarily large), the sigmoid kernel output is bounded and can be negative.
Parameter effects:
Unlike RBF and polynomial kernels, the sigmoid kernel does not have a clean interpretation as an inner product in a finite-dimensional or well-characterized infinite-dimensional space. It's better understood through its neural network connection.
| Advantages | Disadvantages |
|---|---|
| Neural network connection | Not always a valid kernel (Mercer) |
| Similar behavior to multilayer perceptron | Optimization may be non-convex |
| Historical importance in SVM literature | Rarely the best choice in practice |
| Bounded output values | Less stable than RBF or polynomial |
| Two parameters for tuning | Harder to interpret geometrically |
In modern practice, the sigmoid kernel is rarely used. If you want neural network-like behavior, it's usually better to use an actual neural network. For kernel SVM, RBF or polynomial kernels typically outperform sigmoid. The sigmoid kernel remains of theoretical and historical interest.
Let's consolidate our understanding by comparing all four kernels across multiple dimensions.
| Property | Linear | Polynomial | RBF | Sigmoid |
|---|---|---|---|---|
| Formula | $\mathbf{x}^T\mathbf{z}$ | $(\gamma\mathbf{x}^T\mathbf{z}+c)^p$ | $\exp(-\gamma|\mathbf{x}-\mathbf{z}|^2)$ | $\tanh(\gamma\mathbf{x}^T\mathbf{z}+c)$ |
| Feature dim | $d$ | $\binom{d+p}{p}$ | $\infty$ | Not well-defined |
| Hyperparameters | None | $(\gamma, c, p)$ | $\gamma$ | $(\gamma, c)$ |
| Output range | $(-\infty, \infty)$ | $(-\infty, \infty)$ | $(0, 1]$ | $(-1, 1)$ |
| Mercer valid? | Yes | Yes | Yes | Conditional |
| Computation | $O(d)$ | $O(d)$ | $O(d)$ | $O(d)$ |
| Locality | Global | Global | Local | Mixed |
Linear:
Polynomial:
RBF:
Sigmoid:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom sklearn.svm import SVCfrom sklearn.datasets import make_moons, make_circles, make_classificationfrom sklearn.model_selection import cross_val_scorefrom sklearn.preprocessing import StandardScaler def compare_kernels(X, y, name="Dataset"): """ Compare kernel performance on a dataset. """ # Scale features (important for RBF) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) kernels = [ ('Linear', 'linear', {}), ('Poly (d=2)', 'poly', {'degree': 2, 'coef0': 1}), ('Poly (d=3)', 'poly', {'degree': 3, 'coef0': 1}), ('RBF (gamma=0.1)', 'rbf', {'gamma': 0.1}), ('RBF (gamma=1.0)', 'rbf', {'gamma': 1.0}), ('RBF (gamma=10)', 'rbf', {'gamma': 10}), ('Sigmoid', 'sigmoid', {'gamma': 0.1, 'coef0': 0}), ] print(f"\n{name} Results:") print("-" * 60) for name_k, kernel, params in kernels: svm = SVC(kernel=kernel, **params, C=1.0) scores = cross_val_score(svm, X_scaled, y, cv=5) print(f" {name_k:20s}: {scores.mean():.3f} ± {scores.std():.3f}") # Generate different dataset typesnp.random.seed(42) # 1. Linearly separableX_linear, y_linear = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)compare_kernels(X_linear, y_linear, "Linearly Separable") # 2. Moons (nonlinear, smooth boundary)X_moons, y_moons = make_moons(n_samples=200, noise=0.2, random_state=42)compare_kernels(X_moons, y_moons, "Moons (Nonlinear)") # 3. Circles (one class inside another)X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)compare_kernels(X_circles, y_circles, "Concentric Circles") # 4. XOR patternn = 100X_xor = np.random.randn(4*n, 2) * 0.4X_xor[:n] += np.array([1, 1])X_xor[n:2*n] += np.array([-1, -1])X_xor[2*n:3*n] += np.array([1, -1])X_xor[3*n:] += np.array([-1, 1])y_xor = np.array([1]*n + [1]*n + [-1]*n + [-1]*n)compare_kernels(X_xor, y_xor, "XOR Pattern") print("\n" + "=" * 60)print("Key observations:")print("- Linear works well on linearly separable data")print("- Polynomial captures smooth nonlinear boundaries")print("- RBF adapts to complex local patterns")print("- RBF gamma must be tuned: too high = overfit, too low = underfit")Based on decades of practical experience, here are guidelines for choosing kernels.
Is the data high-dimensional and sparse (e.g., text)? → Start with Linear kernel (often sufficient, very fast)
Is the number of features >> number of samples? → Try Linear kernel first (avoids overfitting)
Do you expect smooth polynomial relationships? → Try Polynomial kernel with low degree (2 or 3)
Is the pattern localized or irregular? → Try RBF kernel (most flexible)
Not sure? Default choice: → RBF kernel with tuned $\gamma$ (rarely a bad choice)
RBF kernel is extremely sensitive to feature scaling! Always standardize or normalize features before using RBF. Without scaling, features with large ranges dominate the distance computation. This is less critical for linear and polynomial kernels but still good practice.
For RBF kernel, typical grid search ranges:
For polynomial kernel:
Practical tip: Start with a coarse grid, then refine around the best region. Use cross-validation (5-fold is standard) to evaluate each configuration.
We've thoroughly explored the four foundational kernels that power most practical applications of kernel SVMs.
In the next page, we'll explore nonlinear decision boundaries in depth: how different kernels shape the decision surface, visualization techniques, understanding the bias-variance tradeoff in kernel space, and diagnosing underfitting vs overfitting.
You now have deep knowledge of the major kernel functions used in SVMs. You understand their mathematical forms, implicit feature spaces, hyperparameters, and practical tradeoffs—the foundation for effective kernel SVM applications.