Loading content...
In the previous page, we established that the dual SVM formulation operates entirely through inner products between data points. This structural property opens a profound possibility: what if we replace these inner products with something more powerful?
The kernel trick is precisely this substitution. Instead of computing the standard Euclidean inner product $\mathbf{x}_i^T\mathbf{x}_j$, we compute a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ that corresponds to an inner product in some (possibly very high-dimensional or infinite-dimensional) feature space. The SVM algorithm proceeds exactly as before, but now operates implicitly in this richer space.
This simple substitution transforms SVMs from linear classifiers—limited to hyperplane decision boundaries—into universal function approximators capable of learning arbitrarily complex decision boundaries.
By the end of this page, you will understand: (1) the precise mechanics of kernel substitution in SVM training and prediction, (2) how kernel functions correspond to implicit feature mappings, (3) the mathematical justification via Mercer's theorem, and (4) the computational advantages of working with kernels instead of explicit features.
Let's formalize the kernel substitution and understand exactly what changes—and what stays the same.
Recall the dual SVM problem:
$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\mathbf{x}_i^T\mathbf{x}j}{\text{inner product}}$$
subject to $\sum_i \alpha_i y_i = 0$ and $0 \leq \alpha_i \leq C$.
The prediction function is: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y_i \underbrace{\mathbf{x}i^T\mathbf{x}}{\text{inner product}} + b$$
Every appearance of the data is through inner products.
Now suppose we have a feature map $\phi: \mathbb{R}^d \to \mathcal{H}$ that maps data points to a (possibly infinite-dimensional) Hilbert space $\mathcal{H}$. If we apply this map and run SVM in the new space:
$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\langle\phi(\mathbf{x}_i), \phi(\mathbf{x}j)\rangle}{\text{inner product in }\mathcal{H}}$$
Define the kernel function: $$k(\mathbf{x}, \mathbf{z}) = \langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle$$
The kernel function computes the inner product in feature space directly from the original data—without explicitly computing $\phi(\mathbf{x})$.
The kernelized dual becomes: $$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j , k(\mathbf{x}_i, \mathbf{x}_j)$$
And the prediction function: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y_i , k(\mathbf{x}_i, \mathbf{x}) + b$$
The kernel trick is remarkably simple: everywhere we see an inner product $\mathbf{x}^T\mathbf{z}$, we substitute the kernel $k(\mathbf{x}, \mathbf{z})$. The optimization algorithm, the constraints, the solution structure—everything else remains unchanged. Yet this simple substitution enables nonlinear decision boundaries of arbitrary complexity.
| Component | Linear SVM | Kernelized SVM |
|---|---|---|
| Dual objective | $\sum_i \alpha_i - \frac{1}{2}\sum_{ij} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$ | $\sum_i \alpha_i - \frac{1}{2}\sum_{ij} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$ |
| Constraints | $\sum_i \alpha_i y_i = 0$, $0 \leq \alpha_i \leq C$ | Identical |
| Gram matrix | $K_{ij} = \mathbf{x}_i^T\mathbf{x}_j$ | $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ |
| Prediction | $f(\mathbf{x}) = \sum_i \alpha_i y_i \mathbf{x}_i^T\mathbf{x} + b$ | $f(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$ |
| Decision boundary | Hyperplane in $\mathbb{R}^d$ | Hyperplane in $\mathcal{H}$ → nonlinear in $\mathbb{R}^d$ |
| Feature dimension | Must be finite, explicit | Can be infinite, implicit |
To build intuition, let's examine several feature maps and their corresponding kernels. This illustrates how complex feature spaces can be accessed through simple kernel functions.
Consider 2D data $\mathbf{x} = (x_1, x_2)$. Define the feature map: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$
This maps to 3D. The feature space inner product is: $$\langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle = x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1 x_2 z_1 z_2$$
Recognize this as $(x_1 z_1 + x_2 z_2)^2 = (\mathbf{x}^T\mathbf{z})^2$.
Kernel: $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z})^2$
We can compute this quadratic kernel in $O(d)$ time—the same as the linear kernel—yet we're implicitly operating in a space of $O(d^2)$ features.
Consider a slightly different feature map for 2D data: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2c}x_1, \sqrt{2c}x_2, c)$$
This maps to 6D (for general $d$, the dimension is $\binom{d+2}{2}$). The inner product works out to: $$\langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle = (\mathbf{x}^T\mathbf{z} + c)^2$$
Kernel: $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^2$
More generally, the degree-$p$ polynomial kernel is: $$k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^p$$
This corresponds to a feature space of dimension $\binom{d+p}{p}$, which grows exponentially with $p$. For $d=100$ and $p=5$, the explicit feature space has over 96 million dimensions—yet kernel evaluation takes just $O(d)$ time!
The parameter $c$ in $(\mathbf{x}^T\mathbf{z} + c)^p$ controls the relative weighting of lower vs higher degree terms. When $c = 0$, only degree-$p$ terms appear. When $c > 0$, monomials of all degrees up to $p$ are included. This provides flexibility in matching the complexity of the decision boundary to the data.
The most powerful example: the Radial Basis Function (RBF) or Gaussian kernel: $$k(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right) = \exp\left(-\gamma|\mathbf{x} - \mathbf{z}|^2\right)$$
where $\gamma = 1/(2\sigma^2)$ is a common reparameterization.
The corresponding feature map has infinite dimensions.
Using Taylor expansion of the exponential: $$\exp(xy) = \sum_{k=0}^{\infty} \frac{(xy)^k}{k!}$$
The Gaussian kernel can be shown to correspond to a feature space with one dimension for every possible monomial of every degree. This is a genuine infinite-dimensional space—impossible to represent explicitly, yet perfectly accessible through the kernel.
This is the profound power of kernels: they allow us to operate in infinite-dimensional spaces with finite computation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as np def linear_kernel(x: np.ndarray, z: np.ndarray) -> float: """ Linear kernel: k(x,z) = x^T z Corresponds to the identity feature map. """ return np.dot(x, z) def polynomial_kernel(x: np.ndarray, z: np.ndarray, degree: int = 3, c: float = 1.0) -> float: """ Polynomial kernel: k(x,z) = (x^T z + c)^p Corresponds to feature space of dimension choose(d+p, p). For d=100, p=5: over 96 million features! Yet computation is O(d). """ return (np.dot(x, z) + c) ** degree def rbf_kernel(x: np.ndarray, z: np.ndarray, gamma: float = 1.0) -> float: """ RBF (Gaussian) kernel: k(x,z) = exp(-gamma ||x-z||^2) Corresponds to INFINITE-dimensional feature space. Kernel evaluation is still O(d). """ diff = x - z return np.exp(-gamma * np.dot(diff, diff)) # Demonstrate: explicit polynomial features vs kernel trickdef explicit_quadratic_features(x: np.ndarray) -> np.ndarray: """ Explicitly compute quadratic feature map for 2D data. phi(x) = (x1^2, x2^2, sqrt(2)*x1*x2) """ assert len(x) == 2 return np.array([x[0]**2, x[1]**2, np.sqrt(2)*x[0]*x[1]]) # Compare explicit features vs kernel trickx = np.array([1.0, 2.0])z = np.array([3.0, 4.0]) # Method 1: Explicit feature mapping then inner productphi_x = explicit_quadratic_features(x)phi_z = explicit_quadratic_features(z)explicit_result = np.dot(phi_x, phi_z) # Method 2: Kernel trick (homogeneous polynomial, degree 2)kernel_result = np.dot(x, z) ** 2 print("Explicit feature mapping approach:")print(f" phi(x) = {phi_x}")print(f" phi(z) = {phi_z}")print(f" <phi(x), phi(z)> = {explicit_result}") print(f"\nKernel trick approach:")print(f" k(x, z) = (x^T z)^2 = {kernel_result}") print(f"\nResults match: {np.isclose(explicit_result, kernel_result)}") # Feature space dimensions for polynomial kernelsfrom math import comb d = 100 # Original dimensionfor p in [2, 3, 5, 10]: feature_dim = comb(d + p, p) print(f"\nPolynomial degree {p}:") print(f" Feature space dimension: {feature_dim:,}") print(f" Kernel computation: O({d}) = {d} operations") print(f" Explicit features: O({feature_dim:,}) = {feature_dim:,} values to compute")Not every function of two arguments qualifies as a valid kernel. Mercer's theorem provides the definitive characterization of which functions correspond to inner products in some feature space.
Given a function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$, when does there exist:
such that $k(\mathbf{x}, \mathbf{z}) = \langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle_{\mathcal{H}}$?
Theorem (Mercer, 1909): A symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid kernel (i.e., corresponds to an inner product in some Hilbert space) if and only if the Gram matrix $K$ defined by $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semidefinite for any finite collection of points ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$.
A matrix $K$ is positive semidefinite if for all vectors $\mathbf{c} \in \mathbb{R}^n$: $$\mathbf{c}^T K \mathbf{c} = \sum_{i=1}^n \sum_{j=1}^n c_i c_j K_{ij} \geq 0$$
In kernel literature, 'positive definite kernel' typically means the Gram matrix is positive semidefinite (≥ 0), not strictly positive definite (> 0). This naming convention can cause confusion. A kernel is called 'strictly positive definite' if the Gram matrix is positive definite (> 0) for any set of distinct points.
Mercer's theorem guarantees several critical properties:
Existence of feature space: Every valid kernel implicitly defines a feature space where the SVM can operate.
Well-posed optimization: The kernel matrix appearing in the dual objective is positive semidefinite, ensuring the optimization problem is convex with a unique solution.
Constructive feature space: The proof is constructive—it shows how to build the feature space (though we never need to do this explicitly).
Verification criterion: We can check kernel validity by testing positive semidefiniteness on sample points.
Mercer's theorem constructs a special Hilbert space called the Reproducing Kernel Hilbert Space (RKHS) associated with kernel $k$. In this space:
This abstract construction explains why kernel methods work: we're not doing magic—we're doing geometry in a well-defined (if unusual) space.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom typing import Callable def is_valid_kernel(kernel_func: Callable, X: np.ndarray, tolerance: float = 1e-10) -> bool: """ Test whether a function is a valid kernel using Mercer's theorem. A function k is a valid kernel iff its Gram matrix is positive semidefinite for any finite set of points. Parameters: kernel_func: Function k(x, z) -> float X: Sample points, shape (n_samples, n_features) tolerance: Numerical tolerance for eigenvalue check Returns: True if the Gram matrix is positive semidefinite """ n = X.shape[0] # Build the Gram matrix K = np.zeros((n, n)) for i in range(n): for j in range(n): K[i, j] = kernel_func(X[i], X[j]) # Check symmetry if not np.allclose(K, K.T): print("Kernel is not symmetric!") return False # Check positive semidefiniteness via eigenvalues eigenvalues = np.linalg.eigvalsh(K) min_eigenvalue = eigenvalues.min() if min_eigenvalue < -tolerance: print(f"Kernel is NOT positive semidefinite!") print(f"Minimum eigenvalue: {min_eigenvalue}") return False print(f"Kernel is valid (positive semidefinite)") print(f"Eigenvalue range: [{eigenvalues.min():.6f}, {eigenvalues.max():.6f}]") return True # Test various kernel functionsnp.random.seed(42)X = np.random.randn(20, 5) # Valid kernelsprint("Testing valid kernels:")print("=" * 50) # Linear kernelprint("\n1. Linear kernel: k(x,z) = x^T z")linear = lambda x, z: np.dot(x, z)is_valid_kernel(linear, X) # Polynomial kernelprint("\n2. Polynomial kernel: k(x,z) = (x^T z + 1)^3")poly = lambda x, z: (np.dot(x, z) + 1) ** 3is_valid_kernel(poly, X) # RBF kernelprint("\n3. RBF kernel: k(x,z) = exp(-||x-z||^2)")rbf = lambda x, z: np.exp(-np.sum((x - z)**2))is_valid_kernel(rbf, X) # Invalid "kernel" - not positive semidefiniteprint("\n" + "=" * 50)print("Testing INVALID 'kernel':")print("\nDistance-based (NOT a valid kernel): k(x,z) = -||x-z||")invalid = lambda x, z: -np.linalg.norm(x - z)is_valid_kernel(invalid, X) print("\nLog kernel (NOT a valid kernel): k(x,z) = log(1 + x^T z)")# Note: log kernel IS valid for x^T z > 0, but can fail otherwiselog_kernel = lambda x, z: np.log(1 + np.dot(x, z))is_valid_kernel(log_kernel, X) # May fail depending on XLet's consolidate the kernelized SVM algorithm, showing every step in terms of kernel evaluations.
Input: Training data ${(\mathbf{x}i, y_i)}{i=1}^n$, kernel function $k$, regularization $C$
Step 1: Compute the kernel matrix $$K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) \quad \text{for all } i, j \in {1, \ldots, n}$$
Step 2: Solve the quadratic program $$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K_{ij}$$ subject to: $\sum_i \alpha_i y_i = 0$, $;0 \leq \alpha_i \leq C$
Step 3: Identify support vectors $$\mathcal{S} = {i : \alpha_i > 0}$$
Step 4: Compute bias using free support vectors $\mathcal{M} = {s : 0 < \alpha_s < C}$ $$b = \frac{1}{|\mathcal{M}|} \sum_{s \in \mathcal{M}} \left( y_s - \sum_{i \in \mathcal{S}} \alpha_i y_i K_{is} \right)$$
Output: Store $(\alpha_i, y_i, \mathbf{x}_i)$ for $i \in \mathcal{S}$, kernel function $k$, and bias $b$
Input: New point $\mathbf{x}$, trained model $({\alpha_i, y_i, \mathbf{x}i}{i \in \mathcal{S}}, k, b)$
Step 1: Compute kernel values with support vectors $$k_i = k(\mathbf{x}_i, \mathbf{x}) \quad \text{for } i \in \mathcal{S}$$
Step 2: Compute decision function $$f(\mathbf{x}) = \sum_{i \in \mathcal{S}} \alpha_i y_i k_i + b$$
Step 3: Return prediction $$\hat{y} = \text{sign}(f(\mathbf{x}))$$
Note that prediction requires $|\mathcal{S}|$ kernel evaluations, independent of the training set size. For sparse solutions, this is very efficient.
Notice that the entire algorithm—both training and prediction—never computes explicit feature vectors. We work entirely with kernel evaluations. For the RBF kernel, this means we're implicitly operating in an infinite-dimensional space using only finite computation!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
import numpy as npfrom scipy.optimize import minimizefrom typing import Callable, Tuple class KernelSVM: """ Complete Kernel SVM implementation. Demonstrates the kernel trick: all operations use kernel evaluations, never explicit feature vectors. """ def __init__(self, kernel: Callable, C: float = 1.0): """ Parameters: kernel: Kernel function k(x, z) -> float C: Regularization parameter """ self.kernel = kernel self.C = C # Learned parameters self.support_vectors = None self.support_labels = None self.support_alphas = None self.b = None def _compute_kernel_matrix(self, X: np.ndarray) -> np.ndarray: """Compute the n x n kernel matrix.""" n = X.shape[0] K = np.zeros((n, n)) for i in range(n): for j in range(i, n): k_ij = self.kernel(X[i], X[j]) K[i, j] = k_ij K[j, i] = k_ij # Exploit symmetry return K def fit(self, X: np.ndarray, y: np.ndarray) -> 'KernelSVM': """ Fit the kernel SVM. Parameters: X: Training data, shape (n_samples, n_features) y: Labels, shape (n_samples,), values in {-1, +1} """ n_samples = X.shape[0] # Step 1: Compute kernel matrix K = self._compute_kernel_matrix(X) # Step 2: Solve dual QP # Objective: maximize sum(alpha) - 0.5 * alpha^T Q alpha # where Q_ij = y_i y_j K_ij Q = np.outer(y, y) * K # Convert to minimization: minimize 0.5 * alpha^T Q alpha - sum(alpha) def objective(alpha): return 0.5 * alpha @ Q @ alpha - np.sum(alpha) def gradient(alpha): return Q @ alpha - np.ones(n_samples) # Constraints: sum(alpha * y) = 0 constraints = {'type': 'eq', 'fun': lambda a: np.dot(a, y)} # Bounds: 0 <= alpha_i <= C bounds = [(0, self.C) for _ in range(n_samples)] # Solve alpha0 = np.zeros(n_samples) result = minimize(objective, alpha0, jac=gradient, constraints=constraints, bounds=bounds, method='SLSQP') alpha = result.x # Step 3: Extract support vectors sv_threshold = 1e-5 sv_mask = alpha > sv_threshold self.support_vectors = X[sv_mask] self.support_labels = y[sv_mask] self.support_alphas = alpha[sv_mask] # Step 4: Compute bias from free support vectors free_sv_mask = (alpha > sv_threshold) & (alpha < self.C - sv_threshold) if np.any(free_sv_mask): free_indices = np.where(free_sv_mask)[0] b_values = [] for idx in free_indices: # b = y_s - sum_i alpha_i y_i K(x_i, x_s) kernel_sum = sum( self.support_alphas[i] * self.support_labels[i] * self.kernel(self.support_vectors[i], X[idx]) for i in range(len(self.support_vectors)) ) b_values.append(y[idx] - kernel_sum) self.b = np.mean(b_values) else: # Fallback: use all support vectors self.b = 0.0 return self def decision_function(self, X: np.ndarray) -> np.ndarray: """ Compute decision function values. f(x) = sum_i alpha_i y_i K(x_i, x) + b All computation is via kernel evaluations! """ n_test = X.shape[0] decision_values = np.zeros(n_test) for t in range(n_test): for i in range(len(self.support_vectors)): # Kernel evaluation: the only data access k_val = self.kernel(self.support_vectors[i], X[t]) decision_values[t] += self.support_alphas[i] * self.support_labels[i] * k_val decision_values[t] += self.b return decision_values def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels.""" return np.sign(self.decision_function(X)) # Demonstration: XOR problem (nonlinearly separable)if __name__ == "__main__": np.random.seed(42) # XOR-like data n_per_class = 50 X_pp = np.random.randn(n_per_class, 2) * 0.5 + np.array([1, 1]) X_mm = np.random.randn(n_per_class, 2) * 0.5 + np.array([-1, -1]) X_pm = np.random.randn(n_per_class, 2) * 0.5 + np.array([1, -1]) X_mp = np.random.randn(n_per_class, 2) * 0.5 + np.array([-1, 1]) X = np.vstack([X_pp, X_mm, X_pm, X_mp]) y = np.array([1]*n_per_class + [1]*n_per_class + [-1]*n_per_class + [-1]*n_per_class) # Define RBF kernel def rbf_kernel(x, z, gamma=1.0): return np.exp(-gamma * np.sum((x - z)**2)) # Train kernel SVM svm = KernelSVM(kernel=rbf_kernel, C=10.0) svm.fit(X, y) print(f"Training samples: {len(X)}") print(f"Support vectors: {len(svm.support_vectors)}") print(f"SV fraction: {len(svm.support_vectors)/len(X):.1%}") # Test on grid xx, yy = np.meshgrid(np.linspace(-3, 3, 50), np.linspace(-3, 3, 50)) X_grid = np.c_[xx.ravel(), yy.ravel()] Z = svm.decision_function(X_grid) print(f"\nPrediction complete on {len(X_grid)} points") print(f"Decision values range: [{Z.min():.2f}, {Z.max():.2f}]")Understanding the kernel trick geometrically solidifies intuition about why and how it works.
Recall that linear SVM finds a hyperplane that separates classes with maximum margin. When we apply the kernel trick, we're still finding a hyperplane—but in the high-dimensional feature space $\mathcal{H}$.
The hyperplane in $\mathcal{H}$ is: $$\langle\mathbf{w}, \phi(\mathbf{x})\rangle + b = 0$$
where $\mathbf{w} \in \mathcal{H}$ is the weight vector in feature space.
The representer theorem tells us: $$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \phi(\mathbf{x}_i)$$
So the decision function becomes: $$f(\mathbf{x}) = \left\langle \sum_i \alpha_i y_i \phi(\mathbf{x}_i), \phi(\mathbf{x}) \right\rangle + b = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$
A linear decision boundary in high-dimensional feature space corresponds to a nonlinear decision boundary in the original input space. By choosing different kernels (different implicit feature maps), we control the complexity and shape of the decision boundary.
Consider data on a circle in 2D: $\mathbf{x} = (\cos\theta, \sin\theta)$ for various $\theta$. Points on the upper semicircle ($0 < \theta < \pi$) are class +1; points on the lower semicircle are class -1.
No line separates these classes in 2D. But apply the quadratic feature map: $$\phi(x_1, x_2) = (x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$
For a point on the unit circle: $$\phi(\cos\theta, \sin\theta) = (\cos^2\theta, \sqrt{2}\cos\theta\sin\theta, \sin^2\theta)$$
Using trigonometric identities: $$= \left(\frac{1 + \cos 2\theta}{2}, \frac{\sqrt{2}\sin 2\theta}{2}, \frac{1 - \cos 2\theta}{2}\right)$$
Remarkably, all points map to a plane in 3D (since $x_1^2 + x_2^2 = 1$). The upper and lower semicircles become separable by a line in this plane—equivalently, a hyperplane in 3D feature space.
Different kernels produce different decision boundary shapes:
| Kernel | Decision Boundary Shape | Flexibility | Risk |
|---|---|---|---|
| Linear | Straight lines/hyperplanes | Low | Underfitting on nonlinear data |
| Polynomial (low degree) | Smooth curves (conics, cubics) | Medium | May underfit complex patterns |
| Polynomial (high degree) | Complex polynomial curves | High | Potential overfitting, numerical issues |
| RBF (large $\gamma$) | Highly local, irregular boundaries | Very high | Overfitting, memorization |
| RBF (small $\gamma$) | Smooth, slowly varying boundaries | Medium | Underfitting if too smooth |
Let's analyze the computational requirements of kernel SVMs to understand when they're practical.
Kernel matrix computation: $O(n^2 \cdot T_k)$
Solving the QP: Depends on the algorithm
Overall training: $O(n^2 \cdot d + n^2 \cdot n_{iter})$ for SMO with standard kernels
Per test point: $O(n_{sv} \cdot T_k)$
For $m$ test points: $O(m \cdot n_{sv} \cdot d)$
The $O(n^2)$ kernel matrix is the main scalability limitation. For $n = 100,000$ samples, the matrix requires 80 GB (double precision). Approximation methods (Nyström, random Fourier features) address this by avoiding full kernel matrix computation—covered in Module 5.
| Dataset Size | Kernel Matrix Size | Feasibility | Recommendation |
|---|---|---|---|
| $n \leq 1,000$ | < 8 MB | Easy | Standard kernel SVM with any solver |
| $n \sim 10,000$ | < 800 MB | Moderate | SMO-based solvers (LIBSVM) |
| $n \sim 50,000$ | < 20 GB | Challenging | Sub-sampling or approximations |
| $n \sim 100,000$ | < 80 GB | Difficult | Nyström/RFF approximations required |
| $n > 100,000$ | 80 GB | Prohibitive | Use approximate methods or linear SVM |
Different kernels have different computational costs:
| Kernel | Evaluation Cost | Notes |
|---|---|---|
| Linear | $O(d)$ | Just dot product |
| Polynomial | $O(d)$ | Dot product + power |
| RBF | $O(d)$ | Difference + squared norm + exp |
| String kernels | $O(L^2)$ or worse | $L$ = string length; often expensive |
| Graph kernels | Problem-dependent | Can be very expensive |
For specialized kernels (strings, graphs, etc.), the kernel evaluation cost can dominate. In such cases, precomputing and caching the kernel matrix (if it fits in memory) dramatically speeds up training.
We've now fully understood the kernel trick—the substitution that transforms SVMs from linear classifiers into powerful nonlinear learning machines.
In the next page, we'll explore the most important kernel functions in detail: linear, polynomial, RBF, and sigmoid kernels. We'll understand their properties, how to choose between them, and their hyperparameters' effects on the decision boundary.
You now understand the kernel substitution—the mechanism that enables nonlinear SVMs. This single conceptual shift, replacing inner products with kernel functions, unlocks the full power of kernel methods across machine learning.