Kernel Svm - Learning Module

Loading content...

0/245

Kernel Substitution

The Kernel Trick: Replacing Inner Products

In the previous page, we established that the dual SVM formulation operates entirely through inner products between data points. This structural property opens a profound possibility: what if we replace these inner products with something more powerful?

The kernel trick is precisely this substitution. Instead of computing the standard Euclidean inner product $\mathbf{x}_i^T\mathbf{x}_j$, we compute a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ that corresponds to an inner product in some (possibly very high-dimensional or infinite-dimensional) feature space. The SVM algorithm proceeds exactly as before, but now operates implicitly in this richer space.

This simple substitution transforms SVMs from linear classifiers—limited to hyperplane decision boundaries—into universal function approximators capable of learning arbitrarily complex decision boundaries.

Learning Objectives

By the end of this page, you will understand: (1) the precise mechanics of kernel substitution in SVM training and prediction, (2) how kernel functions correspond to implicit feature mappings, (3) the mathematical justification via Mercer's theorem, and (4) the computational advantages of working with kernels instead of explicit features.

From Inner Products to Kernels

Let's formalize the kernel substitution and understand exactly what changes—and what stays the same.

The Standard (Linear) SVM

Recall the dual SVM problem:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\mathbf{x}_i^T\mathbf{x}j}{\text{inner product}}$$

subject to $\sum_i \alpha_i y_i = 0$ and $0 \leq \alpha_i \leq C$.

The prediction function is: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y_i \underbrace{\mathbf{x}i^T\mathbf{x}}{\text{inner product}} + b$$

Every appearance of the data is through inner products.

The Kernelized SVM

Now suppose we have a feature map $\phi: \mathbb{R}^d \to \mathcal{H}$ that maps data points to a (possibly infinite-dimensional) Hilbert space $\mathcal{H}$. If we apply this map and run SVM in the new space:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \underbrace{\langle\phi(\mathbf{x}_i), \phi(\mathbf{x}j)\rangle}{\text{inner product in }\mathcal{H}}$$

Define the kernel function: $$k(\mathbf{x}, \mathbf{z}) = \langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle$$

The kernel function computes the inner product in feature space directly from the original data—without explicitly computing $\phi(\mathbf{x})$.

The kernelized dual becomes: $$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j , k(\mathbf{x}_i, \mathbf{x}_j)$$

And the prediction function: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y_i , k(\mathbf{x}_i, \mathbf{x}) + b$$

The Elegance of Substitution

The kernel trick is remarkably simple: everywhere we see an inner product $\mathbf{x}^T\mathbf{z}$, we substitute the kernel $k(\mathbf{x}, \mathbf{z})$. The optimization algorithm, the constraints, the solution structure—everything else remains unchanged. Yet this simple substitution enables nonlinear decision boundaries of arbitrary complexity.

Linear SVM vs Kernelized SVM
Component	Linear SVM	Kernelized SVM
Dual objective	$\sum_i \alpha_i - \frac{1}{2}\sum_{ij} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$	$\sum_i \alpha_i - \frac{1}{2}\sum_{ij} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$
Constraints	$\sum_i \alpha_i y_i = 0$, $0 \leq \alpha_i \leq C$	Identical
Gram matrix	$K_{ij} = \mathbf{x}_i^T\mathbf{x}_j$	$K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Prediction	$f(\mathbf{x}) = \sum_i \alpha_i y_i \mathbf{x}_i^T\mathbf{x} + b$	$f(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$
Decision boundary	Hyperplane in $\mathbb{R}^d$	Hyperplane in $\mathcal{H}$ → nonlinear in $\mathbb{R}^d$
Feature dimension	Must be finite, explicit	Can be infinite, implicit

Feature Maps and Their Kernels

To build intuition, let's examine several feature maps and their corresponding kernels. This illustrates how complex feature spaces can be accessed through simple kernel functions.

Example 1: Quadratic Kernel

Consider 2D data $\mathbf{x} = (x_1, x_2)$. Define the feature map: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$

This maps to 3D. The feature space inner product is: $$\langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle = x_1^2 z_1^2 + x_2^2 z_2^2 + 2x_1 x_2 z_1 z_2$$

Recognize this as $(x_1 z_1 + x_2 z_2)^2 = (\mathbf{x}^T\mathbf{z})^2$.

Kernel: $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z})^2$

We can compute this quadratic kernel in $O(d)$ time—the same as the linear kernel—yet we're implicitly operating in a space of $O(d^2)$ features.

Example 2: Polynomial Kernel (Degree 2, with offset)

Consider a slightly different feature map for 2D data: $$\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2c}x_1, \sqrt{2c}x_2, c)$$

This maps to 6D (for general $d$, the dimension is $\binom{d+2}{2}$). The inner product works out to: $$\langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle = (\mathbf{x}^T\mathbf{z} + c)^2$$

Kernel: $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^2$

More generally, the degree-$p$ polynomial kernel is: $$k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^p$$

This corresponds to a feature space of dimension $\binom{d+p}{p}$, which grows exponentially with $p$. For $d=100$ and $p=5$, the explicit feature space has over 96 million dimensions—yet kernel evaluation takes just $O(d)$ time!

The Offset Parameter

The parameter $c$ in $(\mathbf{x}^T\mathbf{z} + c)^p$ controls the relative weighting of lower vs higher degree terms. When $c = 0$, only degree-$p$ terms appear. When $c > 0$, monomials of all degrees up to $p$ are included. This provides flexibility in matching the complexity of the decision boundary to the data.

Example 3: The Gaussian RBF Kernel

The most powerful example: the Radial Basis Function (RBF) or Gaussian kernel: $$k(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right) = \exp\left(-\gamma|\mathbf{x} - \mathbf{z}|^2\right)$$

where $\gamma = 1/(2\sigma^2)$ is a common reparameterization.

The corresponding feature map has infinite dimensions.

Using Taylor expansion of the exponential: $$\exp(xy) = \sum_{k=0}^{\infty} \frac{(xy)^k}{k!}$$

The Gaussian kernel can be shown to correspond to a feature space with one dimension for every possible monomial of every degree. This is a genuine infinite-dimensional space—impossible to represent explicitly, yet perfectly accessible through the kernel.

This is the profound power of kernels: they allow us to operate in infinite-dimensional spaces with finite computation.

kernel_feature_space_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def linear_kernel(x: np.ndarray, z: np.ndarray) -> float:
    """
    Linear kernel: k(x,z) = x^T z
    Corresponds to the identity feature map.
    """
    return np.dot(x, z)
 
def polynomial_kernel(x: np.ndarray, z: np.ndarray, 
                      degree: int = 3, c: float = 1.0) -> float:
    """
    Polynomial kernel: k(x,z) = (x^T z + c)^p
    
    Corresponds to feature space of dimension choose(d+p, p).
    For d=100, p=5: over 96 million features!
    Yet computation is O(d).
    """
    return (np.dot(x, z) + c) ** degree
 
def rbf_kernel(x: np.ndarray, z: np.ndarray, gamma: float = 1.0) -> float:
    """
    RBF (Gaussian) kernel: k(x,z) = exp(-gamma ||x-z||^2)
    
    Corresponds to INFINITE-dimensional feature space.
    Kernel evaluation is still O(d).
    """
    diff = x - z
    return np.exp(-gamma * np.dot(diff, diff))
 
# Demonstrate: explicit polynomial features vs kernel trick
def explicit_quadratic_features(x: np.ndarray) -> np.ndarray:
    """
    Explicitly compute quadratic feature map for 2D data.
    phi(x) = (x1^2, x2^2, sqrt(2)*x1*x2)
    """
    assert len(x) == 2
    return np.array([x[0]**2, x[1]**2, np.sqrt(2)*x[0]*x[1]])
 
# Compare explicit features vs kernel trick
x = np.array([1.0, 2.0])
z = np.array([3.0, 4.0])
 
# Method 1: Explicit feature mapping then inner product
phi_x = explicit_quadratic_features(x)
phi_z = explicit_quadratic_features(z)
explicit_result = np.dot(phi_x, phi_z)
 
# Method 2: Kernel trick (homogeneous polynomial, degree 2)
kernel_result = np.dot(x, z) ** 2
 
print("Explicit feature mapping approach:")
print(f"  phi(x) = {phi_x}")
print(f"  phi(z) = {phi_z}")
print(f"  <phi(x), phi(z)> = {explicit_result}")
 
print(f"\nKernel trick approach:")
print(f"  k(x, z) = (x^T z)^2 = {kernel_result}")
 
print(f"\nResults match: {np.isclose(explicit_result, kernel_result)}")
 
# Feature space dimensions for polynomial kernels
from math import comb
 
d = 100  # Original dimension
for p in [2, 3, 5, 10]:
    feature_dim = comb(d + p, p)
    print(f"\nPolynomial degree {p}:")
    print(f"  Feature space dimension: {feature_dim:,}")
    print(f"  Kernel computation: O({d}) = {d} operations")
    print(f"  Explicit features: O({feature_dim:,}) = {feature_dim:,} values to compute")

Mercer's Theorem: The Theoretical Foundation

Not every function of two arguments qualifies as a valid kernel. Mercer's theorem provides the definitive characterization of which functions correspond to inner products in some feature space.

The Question

Given a function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$, when does there exist:

A Hilbert space $\mathcal{H}$
A feature map $\phi: \mathcal{X} \to \mathcal{H}$

such that $k(\mathbf{x}, \mathbf{z}) = \langle\phi(\mathbf{x}), \phi(\mathbf{z})\rangle_{\mathcal{H}}$?

The Answer: Mercer's Theorem

Theorem (Mercer, 1909): A symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a valid kernel (i.e., corresponds to an inner product in some Hilbert space) if and only if the Gram matrix $K$ defined by $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semidefinite for any finite collection of points ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$.

A matrix $K$ is positive semidefinite if for all vectors $\mathbf{c} \in \mathbb{R}^n$: $$\mathbf{c}^T K \mathbf{c} = \sum_{i=1}^n \sum_{j=1}^n c_i c_j K_{ij} \geq 0$$

Positive Definiteness ≠ Positive Semidefiniteness

In kernel literature, 'positive definite kernel' typically means the Gram matrix is positive semidefinite (≥ 0), not strictly positive definite (> 0). This naming convention can cause confusion. A kernel is called 'strictly positive definite' if the Gram matrix is positive definite (> 0) for any set of distinct points.

Why This Matters

Mercer's theorem guarantees several critical properties:

Existence of feature space: Every valid kernel implicitly defines a feature space where the SVM can operate.
Well-posed optimization: The kernel matrix appearing in the dual objective is positive semidefinite, ensuring the optimization problem is convex with a unique solution.
Constructive feature space: The proof is constructive—it shows how to build the feature space (though we never need to do this explicitly).
Verification criterion: We can check kernel validity by testing positive semidefiniteness on sample points.

The Reproducing Kernel Hilbert Space (RKHS)

Mercer's theorem constructs a special Hilbert space called the Reproducing Kernel Hilbert Space (RKHS) associated with kernel $k$. In this space:

The feature map is $\phi(\mathbf{x}) = k(\cdot, \mathbf{x})$ — a function!
Inner products satisfy: $\langle k(\cdot, \mathbf{x}), k(\cdot, \mathbf{z})\rangle = k(\mathbf{x}, \mathbf{z})$
The reproducing property: $\langle f, k(\cdot, \mathbf{x})\rangle = f(\mathbf{x})$ for any $f \in \mathcal{H}$

This abstract construction explains why kernel methods work: we're not doing magic—we're doing geometry in a well-defined (if unusual) space.

verify_kernel_validity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from typing import Callable
 
def is_valid_kernel(kernel_func: Callable, X: np.ndarray, 
                    tolerance: float = 1e-10) -> bool:
    """
    Test whether a function is a valid kernel using Mercer's theorem.
    
    A function k is a valid kernel iff its Gram matrix is positive semidefinite
    for any finite set of points.
    
    Parameters:
        kernel_func: Function k(x, z) -> float
        X: Sample points, shape (n_samples, n_features)
        tolerance: Numerical tolerance for eigenvalue check
    
    Returns:
        True if the Gram matrix is positive semidefinite
    """
    n = X.shape[0]
    
    # Build the Gram matrix
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_func(X[i], X[j])
    
    # Check symmetry
    if not np.allclose(K, K.T):
        print("Kernel is not symmetric!")
        return False
    
    # Check positive semidefiniteness via eigenvalues
    eigenvalues = np.linalg.eigvalsh(K)
    min_eigenvalue = eigenvalues.min()
    
    if min_eigenvalue < -tolerance:
        print(f"Kernel is NOT positive semidefinite!")
        print(f"Minimum eigenvalue: {min_eigenvalue}")
        return False
    
    print(f"Kernel is valid (positive semidefinite)")
    print(f"Eigenvalue range: [{eigenvalues.min():.6f}, {eigenvalues.max():.6f}]")
    return True
 
# Test various kernel functions
np.random.seed(42)
X = np.random.randn(20, 5)
 
# Valid kernels
print("Testing valid kernels:")
print("=" * 50)
 
# Linear kernel
print("\n1. Linear kernel: k(x,z) = x^T z")
linear = lambda x, z: np.dot(x, z)
is_valid_kernel(linear, X)
 
# Polynomial kernel
print("\n2. Polynomial kernel: k(x,z) = (x^T z + 1)^3")
poly = lambda x, z: (np.dot(x, z) + 1) ** 3
is_valid_kernel(poly, X)
 
# RBF kernel
print("\n3. RBF kernel: k(x,z) = exp(-||x-z||^2)")
rbf = lambda x, z: np.exp(-np.sum((x - z)**2))
is_valid_kernel(rbf, X)
 
# Invalid "kernel" - not positive semidefinite
print("\n" + "=" * 50)
print("Testing INVALID 'kernel':")
print("\nDistance-based (NOT a valid kernel): k(x,z) = -||x-z||")
invalid = lambda x, z: -np.linalg.norm(x - z)
is_valid_kernel(invalid, X)
 
print("\nLog kernel (NOT a valid kernel): k(x,z) = log(1 + x^T z)")
# Note: log kernel IS valid for x^T z > 0, but can fail otherwise
log_kernel = lambda x, z: np.log(1 + np.dot(x, z))
is_valid_kernel(log_kernel, X)  # May fail depending on X

The Complete Kernelized Algorithm

Let's consolidate the kernelized SVM algorithm, showing every step in terms of kernel evaluations.

Training Phase

Input: Training data ${(\mathbf{x}i, y_i)}{i=1}^n$, kernel function $k$, regularization $C$

Step 1: Compute the kernel matrix $$K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) \quad \text{for all } i, j \in {1, \ldots, n}$$

Step 2: Solve the quadratic program $$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K_{ij}$$ subject to: $\sum_i \alpha_i y_i = 0$, $;0 \leq \alpha_i \leq C$

Step 3: Identify support vectors $$\mathcal{S} = {i : \alpha_i > 0}$$

Step 4: Compute bias using free support vectors $\mathcal{M} = {s : 0 < \alpha_s < C}$ $$b = \frac{1}{|\mathcal{M}|} \sum_{s \in \mathcal{M}} \left( y_s - \sum_{i \in \mathcal{S}} \alpha_i y_i K_{is} \right)$$

Output: Store $(\alpha_i, y_i, \mathbf{x}_i)$ for $i \in \mathcal{S}$, kernel function $k$, and bias $b$

Prediction Phase

Input: New point $\mathbf{x}$, trained model $({\alpha_i, y_i, \mathbf{x}i}{i \in \mathcal{S}}, k, b)$

Step 1: Compute kernel values with support vectors $$k_i = k(\mathbf{x}_i, \mathbf{x}) \quad \text{for } i \in \mathcal{S}$$

Step 2: Compute decision function $$f(\mathbf{x}) = \sum_{i \in \mathcal{S}} \alpha_i y_i k_i + b$$

Step 3: Return prediction $$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Note that prediction requires $|\mathcal{S}|$ kernel evaluations, independent of the training set size. For sparse solutions, this is very efficient.

No Explicit Features Anywhere

Notice that the entire algorithm—both training and prediction—never computes explicit feature vectors. We work entirely with kernel evaluations. For the RBF kernel, this means we're implicitly operating in an infinite-dimensional space using only finite computation!

kernelized_svm_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from scipy.optimize import minimize
from typing import Callable, Tuple
 
class KernelSVM:
    """
    Complete Kernel SVM implementation.
    
    Demonstrates the kernel trick: all operations use kernel evaluations,
    never explicit feature vectors.
    """
    
    def __init__(self, kernel: Callable, C: float = 1.0):
        """
        Parameters:
            kernel: Kernel function k(x, z) -> float
            C: Regularization parameter
        """
        self.kernel = kernel
        self.C = C
        
        # Learned parameters
        self.support_vectors = None
        self.support_labels = None
        self.support_alphas = None
        self.b = None
    
    def _compute_kernel_matrix(self, X: np.ndarray) -> np.ndarray:
        """Compute the n x n kernel matrix."""
        n = X.shape[0]
        K = np.zeros((n, n))
        for i in range(n):
            for j in range(i, n):
                k_ij = self.kernel(X[i], X[j])
                K[i, j] = k_ij
                K[j, i] = k_ij  # Exploit symmetry
        return K
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'KernelSVM':
        """
        Fit the kernel SVM.
        
        Parameters:
            X: Training data, shape (n_samples, n_features)
            y: Labels, shape (n_samples,), values in {-1, +1}
        """
        n_samples = X.shape[0]
        
        # Step 1: Compute kernel matrix
        K = self._compute_kernel_matrix(X)
        
        # Step 2: Solve dual QP
        # Objective: maximize sum(alpha) - 0.5 * alpha^T Q alpha
        # where Q_ij = y_i y_j K_ij
        Q = np.outer(y, y) * K
        
        # Convert to minimization: minimize 0.5 * alpha^T Q alpha - sum(alpha)
        def objective(alpha):
            return 0.5 * alpha @ Q @ alpha - np.sum(alpha)
        
        def gradient(alpha):
            return Q @ alpha - np.ones(n_samples)
        
        # Constraints: sum(alpha * y) = 0
        constraints = {'type': 'eq', 'fun': lambda a: np.dot(a, y)}
        
        # Bounds: 0 <= alpha_i <= C
        bounds = [(0, self.C) for _ in range(n_samples)]
        
        # Solve
        alpha0 = np.zeros(n_samples)
        result = minimize(objective, alpha0, jac=gradient, 
                         constraints=constraints, bounds=bounds,
                         method='SLSQP')
        alpha = result.x
        
        # Step 3: Extract support vectors
        sv_threshold = 1e-5
        sv_mask = alpha > sv_threshold
        
        self.support_vectors = X[sv_mask]
        self.support_labels = y[sv_mask]
        self.support_alphas = alpha[sv_mask]
        
        # Step 4: Compute bias from free support vectors
        free_sv_mask = (alpha > sv_threshold) & (alpha < self.C - sv_threshold)
        if np.any(free_sv_mask):
            free_indices = np.where(free_sv_mask)[0]
            b_values = []
            for idx in free_indices:
                # b = y_s - sum_i alpha_i y_i K(x_i, x_s)
                kernel_sum = sum(
                    self.support_alphas[i] * self.support_labels[i] * 
                    self.kernel(self.support_vectors[i], X[idx])
                    for i in range(len(self.support_vectors))
                )
                b_values.append(y[idx] - kernel_sum)
            self.b = np.mean(b_values)
        else:
            # Fallback: use all support vectors
            self.b = 0.0
        
        return self
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """
        Compute decision function values.
        
        f(x) = sum_i alpha_i y_i K(x_i, x) + b
        
        All computation is via kernel evaluations!
        """
        n_test = X.shape[0]
        decision_values = np.zeros(n_test)
        
        for t in range(n_test):
            for i in range(len(self.support_vectors)):
                # Kernel evaluation: the only data access
                k_val = self.kernel(self.support_vectors[i], X[t])
                decision_values[t] += self.support_alphas[i] * self.support_labels[i] * k_val
            decision_values[t] += self.b
        
        return decision_values
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        return np.sign(self.decision_function(X))
 
# Demonstration: XOR problem (nonlinearly separable)
if __name__ == "__main__":
    np.random.seed(42)
    
    # XOR-like data
    n_per_class = 50
    X_pp = np.random.randn(n_per_class, 2) * 0.5 + np.array([1, 1])
    X_mm = np.random.randn(n_per_class, 2) * 0.5 + np.array([-1, -1])
    X_pm = np.random.randn(n_per_class, 2) * 0.5 + np.array([1, -1])
    X_mp = np.random.randn(n_per_class, 2) * 0.5 + np.array([-1, 1])
    
    X = np.vstack([X_pp, X_mm, X_pm, X_mp])
    y = np.array([1]*n_per_class + [1]*n_per_class + 
                 [-1]*n_per_class + [-1]*n_per_class)
    
    # Define RBF kernel
    def rbf_kernel(x, z, gamma=1.0):
        return np.exp(-gamma * np.sum((x - z)**2))
    
    # Train kernel SVM
    svm = KernelSVM(kernel=rbf_kernel, C=10.0)
    svm.fit(X, y)
    
    print(f"Training samples: {len(X)}")
    print(f"Support vectors: {len(svm.support_vectors)}")
    print(f"SV fraction: {len(svm.support_vectors)/len(X):.1%}")
    
    # Test on grid
    xx, yy = np.meshgrid(np.linspace(-3, 3, 50), np.linspace(-3, 3, 50))
    X_grid = np.c_[xx.ravel(), yy.ravel()]
    Z = svm.decision_function(X_grid)
    
    print(f"\nPrediction complete on {len(X_grid)} points")
    print(f"Decision values range: [{Z.min():.2f}, {Z.max():.2f}]")

Geometric Interpretation

Understanding the kernel trick geometrically solidifies intuition about why and how it works.

Linear SVM in Feature Space

Recall that linear SVM finds a hyperplane that separates classes with maximum margin. When we apply the kernel trick, we're still finding a hyperplane—but in the high-dimensional feature space $\mathcal{H}$.

The hyperplane in $\mathcal{H}$ is: $$\langle\mathbf{w}, \phi(\mathbf{x})\rangle + b = 0$$

where $\mathbf{w} \in \mathcal{H}$ is the weight vector in feature space.

The representer theorem tells us: $$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \phi(\mathbf{x}_i)$$

So the decision function becomes: $$f(\mathbf{x}) = \left\langle \sum_i \alpha_i y_i \phi(\mathbf{x}_i), \phi(\mathbf{x}) \right\rangle + b = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$

The Key Insight

A linear decision boundary in high-dimensional feature space corresponds to a nonlinear decision boundary in the original input space. By choosing different kernels (different implicit feature maps), we control the complexity and shape of the decision boundary.

Visualizing the Kernel Mapping

Consider data on a circle in 2D: $\mathbf{x} = (\cos\theta, \sin\theta)$ for various $\theta$. Points on the upper semicircle ($0 < \theta < \pi$) are class +1; points on the lower semicircle are class -1.

No line separates these classes in 2D. But apply the quadratic feature map: $$\phi(x_1, x_2) = (x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$

For a point on the unit circle: $$\phi(\cos\theta, \sin\theta) = (\cos^2\theta, \sqrt{2}\cos\theta\sin\theta, \sin^2\theta)$$

Using trigonometric identities: $$= \left(\frac{1 + \cos 2\theta}{2}, \frac{\sqrt{2}\sin 2\theta}{2}, \frac{1 - \cos 2\theta}{2}\right)$$

Remarkably, all points map to a plane in 3D (since $x_1^2 + x_2^2 = 1$). The upper and lower semicircles become separable by a line in this plane—equivalently, a hyperplane in 3D feature space.

Decision Boundary Shapes

Different kernels produce different decision boundary shapes:

Linear kernel: Hyperplanes only—straight lines in 2D, planes in 3D.
Polynomial kernel (degree 2): Conic sections—ellipses, parabolas, hyperbolas.
Polynomial kernel (degree $p$): Polynomials of degree up to $p$.
RBF kernel: Arbitrarily complex shapes—can approximate any continuous decision boundary.

Decision Boundary Shapes by Kernel Type
Kernel	Decision Boundary Shape	Flexibility	Risk
Linear	Straight lines/hyperplanes	Low	Underfitting on nonlinear data
Polynomial (low degree)	Smooth curves (conics, cubics)	Medium	May underfit complex patterns
Polynomial (high degree)	Complex polynomial curves	High	Potential overfitting, numerical issues
RBF (large $\gamma$)	Highly local, irregular boundaries	Very high	Overfitting, memorization
RBF (small $\gamma$)	Smooth, slowly varying boundaries	Medium	Underfitting if too smooth

Computational Complexity of Kernelized SVM

Let's analyze the computational requirements of kernel SVMs to understand when they're practical.

Training Complexity

Kernel matrix computation: $O(n^2 \cdot T_k)$

$n^2$ pairs of points to compute
$T_k$ = time for one kernel evaluation (typically $O(d)$ for standard kernels)

Solving the QP: Depends on the algorithm

General QP solvers: $O(n^3)$
SMO algorithm: $O(n^2)$ per pass, typically $O(n^2 \cdot n_{iter})$ total
Smart caching and working set selection: closer to $O(n \cdot n_{sv}^2)$ in practice

Overall training: $O(n^2 \cdot d + n^2 \cdot n_{iter})$ for SMO with standard kernels

Prediction Complexity

Per test point: $O(n_{sv} \cdot T_k)$

One kernel evaluation for each support vector
For standard kernels: $O(n_{sv} \cdot d)$

For $m$ test points: $O(m \cdot n_{sv} \cdot d)$

The Kernel Matrix Bottleneck

The $O(n^2)$ kernel matrix is the main scalability limitation. For $n = 100,000$ samples, the matrix requires 80 GB (double precision). Approximation methods (Nyström, random Fourier features) address this by avoiding full kernel matrix computation—covered in Module 5.

Scalability Guidelines for Kernel SVM
Dataset Size	Kernel Matrix Size	Feasibility	Recommendation
$n \leq 1,000$	< 8 MB	Easy	Standard kernel SVM with any solver
$n \sim 10,000$	< 800 MB	Moderate	SMO-based solvers (LIBSVM)
$n \sim 50,000$	< 20 GB	Challenging	Sub-sampling or approximations
$n \sim 100,000$	< 80 GB	Difficult	Nyström/RFF approximations required
$n > 100,000$	80 GB	Prohibitive	Use approximate methods or linear SVM

The Kernel Evaluation Cost

Different kernels have different computational costs:

Kernel	Evaluation Cost	Notes
Linear	$O(d)$	Just dot product
Polynomial	$O(d)$	Dot product + power
RBF	$O(d)$	Difference + squared norm + exp
String kernels	$O(L^2)$ or worse	$L$ = string length; often expensive
Graph kernels	Problem-dependent	Can be very expensive

For specialized kernels (strings, graphs, etc.), the kernel evaluation cost can dominate. In such cases, precomputing and caching the kernel matrix (if it fits in memory) dramatically speeds up training.

Practical Speed-up Strategies

•Kernel caching: Store recently computed kernel values to avoid recomputation during SMO.
•Shrinking: Temporarily remove bounded support vectors (α = C) from optimization.
•Working set selection: Choose pairs to optimize based on KKT violation magnitude.
•Parallel kernel matrix computation: Kernel evaluations are embarrassingly parallel.
•Subsampling for large datasets: Train on a representative subset, use full data for final model selection.

Summary: The Power of Kernel Substitution

We've now fully understood the kernel trick—the substitution that transforms SVMs from linear classifiers into powerful nonlinear learning machines.

Key Takeaways

•The substitution is simple: Replace every inner product $\mathbf{x}^T\mathbf{z}$ with a kernel function $k(\mathbf{x}, \mathbf{z})$. The algorithm remains unchanged.
•Kernels define implicit feature spaces: Each kernel corresponds to an inner product in some (possibly infinite-dimensional) Hilbert space.
•Mercer's theorem guarantees validity: A function is a valid kernel iff its Gram matrix is positive semidefinite for any point set.
•Nonlinear boundaries from linear algorithms: The SVM still finds a maximum-margin hyperplane—but in feature space, which becomes a complex boundary in input space.
•Computational cost scales with n, not d: Training and prediction depend on sample size and support vector count, not on the (possibly infinite) feature dimension.
•Different kernels, different decision shapes: Polynomial kernels give polynomial boundaries; RBF kernels give flexible, local boundaries.

What's Next

In the next page, we'll explore the most important kernel functions in detail: linear, polynomial, RBF, and sigmoid kernels. We'll understand their properties, how to choose between them, and their hyperparameters' effects on the decision boundary.

Page Complete

You now understand the kernel substitution—the mechanism that enables nonlinear SVMs. This single conceptual shift, replacing inner products with kernel functions, unlocks the full power of kernel methods across machine learning.