Machine LearningKernel SVM

Kernel Support Vector Machines

LevelAdvanced

Duration90 mins

TopicKernel SVM

1 / 5

Dual Form Advantage

The Key to Nonlinear Classification

In the previous modules, we developed the mathematical machinery of Support Vector Machines: the maximum margin principle, the hard margin formulation, and the soft margin extension with slack variables. We derived both primal and dual optimization problems, establishing the Karush-Kuhn-Tucker (KKT) conditions that characterize optimal solutions.

Through all of this work, we remained in the realm of linear classifiers. The decision boundary was always a hyperplane: $\mathbf{w}^T\mathbf{x} + b = 0$. But real-world classification problems are often fundamentally nonlinear. Consider separating images of cats from dogs, detecting fraudulent transactions among millions of legitimate ones, or classifying proteins by their three-dimensional structure. In these problems, no hyperplane in the original feature space can adequately separate the classes.

The remarkable insight that transforms SVMs from elegant linear classifiers into universal function approximators lies in rewriting the problem in its dual form. This seemingly algebraic manipulation—converting from optimization over $\mathbf{w}$ and $b$ to optimization over Lagrange multipliers $\alpha_i$—unlocks a profound capability: the ability to operate entirely through inner products between data points, which in turn enables the kernel trick.

What You Will Master

By the end of this page, you will understand precisely why the dual formulation is essential for kernel SVMs. You'll see how the dual problem's structure—expressing everything through inner products—is not merely a mathematical curiosity but the architectural foundation that enables nonlinear classification. You'll appreciate the conceptual leap from 'optimizing weight vectors' to 'optimizing combinations of training points.'

The Primal and Dual Perspectives

Let us revisit the two equivalent formulations of the SVM problem to understand what each representation reveals.

The Primal Problem

For soft-margin SVM, the primal optimization problem is:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$

subject to: $$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0, \quad i = 1, \ldots, n$$

Here, we optimize directly over:

The weight vector $\mathbf{w} \in \mathbb{R}^d$ (where $d$ is the feature dimensionality)
The bias term $b \in \mathbb{R}$
The slack variables $\boldsymbol{\xi} = (\xi_1, \ldots, \xi_n)$

The number of optimization variables is $d + 1 + n$. When the feature dimension $d$ is very large (or even infinite, as we'll soon encounter), this formulation becomes computationally intractable or impossible.

The Dual Problem

By introducing Lagrange multipliers and applying the KKT conditions, we transform this into the dual problem:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$

subject to: $$\sum_{i=1}^n \alpha_i y_i = 0, \quad 0 \leq \alpha_i \leq C, \quad i = 1, \ldots, n$$

Now we optimize over the Lagrange multipliers $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)$, with exactly $n$ optimization variables regardless of feature dimensionality. The weight vector is recovered as:

$$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$$

This is the representer theorem in action: the optimal solution is a linear combination of the training examples, weighted by the Lagrange multipliers.

The Critical Observation

Look carefully at the dual objective function. The data points $\mathbf{x}_i$ and $\mathbf{x}_j$ appear only as inner products $\mathbf{x}_i^T\mathbf{x}_j$. Neither individual coordinates nor the dimensionality $d$ appear directly. This structure is not an accident—it's the mathematical gateway to nonlinear SVMs.

Comparing Primal and Dual SVM Formulations
Aspect	Primal Formulation	Dual Formulation
Optimization variables	$\mathbf{w}, b, \boldsymbol{\xi}$ — total: $d + 1 + n$	$\boldsymbol{\alpha}$ — total: $n$
Depends on feature dim $d$	Yes — directly in $\mathbf{w}$	No — only through inner products
Data appears as	Individual feature vectors $\mathbf{x}_i$	Inner products $\mathbf{x}_i^T\mathbf{x}_j$
When $d > n$	More variables than samples	Fewer variables than features
Infinite $d$ possible?	No — cannot represent $\mathbf{w}$	Yes — through kernel trick
Solution sparsity	Not directly visible	Sparsity in $\alpha_i$ (support vectors)

The Inner Product Structure

The dual formulation's dependence on inner products is not merely algebraic convenience—it represents a fundamental shift in how we view the classification problem.

From Coordinates to Similarities

In the primal formulation, we think of each data point $\mathbf{x}_i$ as a vector of coordinates in $\mathbb{R}^d$. The algorithm manipulates these coordinates directly to find the optimal hyperplane.

In the dual formulation, we think of each data point through its relationships with all other data points. The inner product $\mathbf{x}_i^T\mathbf{x}_j$ measures the similarity between points $i$ and $j$ in a very specific sense:

$$\mathbf{x}_i^T\mathbf{x}_j = |\mathbf{x}_i| \cdot |\mathbf{x}j| \cdot \cos(\theta{ij})$$

where $\theta_{ij}$ is the angle between the two vectors. Points that are aligned (small angle) have large inner products; orthogonal points have zero inner product; opposite points have negative inner products.

The Gram Matrix

The matrix $K$ with entries $K_{ij} = \mathbf{x}_i^T\mathbf{x}_j$ is called the Gram matrix (or Gramian). It captures all pairwise similarities in the dataset and completely encodes the geometry relevant to SVM. If we have the Gram matrix, we can solve the dual problem without ever accessing the original coordinates.

The Dual Objective Geometrically

Let's interpret the dual objective function:

$$L_D(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$

The first term $\sum_i \alpha_i$ encourages large weights—it wants to 'turn on' many support vectors. The second term penalizes configurations where similarly-labeled points ($y_i = y_j$) that are aligned (large $\mathbf{x}_i^T\mathbf{x}_j$) both have large weights. This tension drives the optimization to find the minimal set of support vectors that define the maximum margin boundary.

Define the signed kernel matrix $\tilde{K}$ with entries $\tilde{K}{ij} = y_i y_j K{ij} = y_i y_j \mathbf{x}_i^T\mathbf{x}_j$. Then the dual objective becomes:

$$L_D(\boldsymbol{\alpha}) = \mathbf{1}^T\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^T\tilde{K}\boldsymbol{\alpha}$$

This is a standard quadratic programming problem. The Gram matrix $K$ (and hence $\tilde{K}$) is positive semidefinite for inner products of real vectors, guaranteeing a unique maximum.

gram_matrix_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
 
def compute_gram_matrix(X: np.ndarray) -> np.ndarray:
    """
    Compute the Gram matrix of inner products for dataset X.
    
    Parameters:
        X: Data matrix of shape (n_samples, n_features)
           Each row is a data point
    
    Returns:
        K: Gram matrix of shape (n_samples, n_samples)
           K[i,j] = X[i].T @ X[j] = inner product between i-th and j-th points
    
    The Gram matrix fully encodes the pairwise geometry of the data.
    For SVM, this is all we need—individual coordinates are not required.
    """
    # Efficient vectorized computation: K = X @ X.T
    # Each entry K[i,j] = sum over d of X[i,d] * X[j,d] = <x_i, x_j>
    K = X @ X.T
    return K
 
def compute_signed_gram_matrix(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute the signed Gram matrix for SVM dual formulation.
    
    Parameters:
        X: Data matrix of shape (n_samples, n_features)
        y: Labels of shape (n_samples,), values in {-1, +1}
    
    Returns:
        K_signed: Signed Gram matrix where K_signed[i,j] = y[i] * y[j] * <x_i, x_j>
    
    This is the matrix that appears in the quadratic term of the dual objective.
    """
    K = compute_gram_matrix(X)
    # Outer product of labels creates sign pattern
    y_outer = np.outer(y, y)
    K_signed = K * y_outer
    return K_signed
 
# Example: visualizing the Gram matrix structure
np.random.seed(42)
n_samples, n_features = 100, 50
X = np.random.randn(n_samples, n_features)
K = compute_gram_matrix(X)
 
print(f"Data shape: {X.shape}")
print(f"Gram matrix shape: {K.shape}")
print(f"Gram matrix is symmetric: {np.allclose(K, K.T)}")
print(f"Gram matrix rank: {np.linalg.matrix_rank(K)}")
 
# Verify positive semidefiniteness
eigenvalues = np.linalg.eigvalsh(K)
print(f"All eigenvalues ≥ 0: {np.all(eigenvalues >= -1e-10)}")
print(f"Minimum eigenvalue: {eigenvalues.min():.6f}")

Prediction Using Inner Products

The dual formulation's inner product structure extends beautifully to prediction. Once we've solved for the optimal $\boldsymbol{\alpha}^*$, how do we classify a new point $\mathbf{x}$?

The Classification Function

Recall that the optimal weight vector is: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i$$

The decision function is: $$f(\mathbf{x}) = \mathbf{w}^{T}\mathbf{x} + b^ = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i^T\mathbf{x} + b^*$$

The predicted label is $\text{sign}(f(\mathbf{x}))$.

The Crucial Observation

The test point $\mathbf{x}$ appears only through its inner products with training points $\mathbf{x}_i^T\mathbf{x}$.

To classify a new point, we don't need to know its coordinates. We only need to know its similarity (inner product) with each training point. Combined with the fact that the dual optimization also uses only inner products, we have a complete algorithm that operates entirely in terms of similarities.

This is the first half of the kernel trick: if we can compute similarities in some implicit feature space without computing the features themselves, nothing changes.

Sparsity in Practice

In practice, most $\alpha_i^* = 0$. Only points with $\alpha_i^* > 0$ contribute to the decision function—these are the support vectors. For a typical dataset, 5-30% of training points become support vectors. So prediction involves summing over support vectors only, not all training points.

Computing the Bias Term

The bias $b^$ is computed using the KKT complementary slackness conditions. For any support vector $\mathbf{x}_s$ with $0 < \alpha_s^ < C$ (lying exactly on the margin), we have:

$$y_s(\mathbf{w}^{T}\mathbf{x}_s + b^) = 1$$

Solving for $b^*$:

$$b^* = y_s - \mathbf{w}^{T}\mathbf{x}s = y_s - \sum{i=1}^n \alpha_i^ y_i \mathbf{x}_i^T\mathbf{x}_s$$

Again, the computation involves only inner products. In practice, $b^*$ is averaged over all margin support vectors for numerical stability:

$$b^* = \frac{1}{|M|} \sum_{s \in M} \left( y_s - \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i^T\mathbf{x}_s \right)$$

where $M = {s : 0 < \alpha_s^* < C}$ is the set of free support vectors.

svm_dual_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import Tuple
 
class DualSVM:
    """
    SVM classifier using the dual formulation.
    Demonstrates that classification uses only inner products.
    """
    
    def __init__(self, C: float = 1.0):
        self.C = C
        self.alpha = None      # Lagrange multipliers
        self.support_vectors = None
        self.support_labels = None
        self.support_alphas = None
        self.b = None
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'DualSVM':
        """
        Fit the SVM by solving the dual QP problem.
        (Simplified implementation for illustration)
        """
        n_samples = X.shape[0]
        
        # Compute Gram matrix - this is where feature dimension enters
        K = X @ X.T
        
        # In practice, we'd use a QP solver here (cvxopt, OSQP, etc.)
        # For illustration, we'll use a simplified gradient ascent
        alpha = self._solve_dual_qp(K, y)
        
        # Store results
        self.alpha = alpha
        
        # Identify support vectors (alpha > small threshold)
        sv_mask = alpha > 1e-7
        self.support_vectors = X[sv_mask]
        self.support_labels = y[sv_mask]
        self.support_alphas = alpha[sv_mask]
        
        # Compute bias using free support vectors (0 < alpha < C)
        free_sv_mask = (alpha > 1e-7) & (alpha < self.C - 1e-7)
        if np.any(free_sv_mask):
            # For free SVs, the constraint is tight: y_s(w·x_s + b) = 1
            b_values = []
            for idx in np.where(free_sv_mask)[0]:
                # w·x_s = sum_i alpha_i y_i (x_i · x_s)
                w_dot_xs = np.sum(self.support_alphas * self.support_labels * 
                                   (self.support_vectors @ X[idx]))
                b_values.append(y[idx] - w_dot_xs)
            self.b = np.mean(b_values)
        else:
            # Fallback: use average over all support vectors
            self.b = 0.0  # Simplified
        
        return self
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """
        Compute decision function values for input points.
        
        Note: Classification uses ONLY inner products between
        test points and support vectors. No explicit feature
        computation is needed.
        """
        # For each test point x, compute:
        # f(x) = sum_s alpha_s y_s (x_s · x) + b
        #
        # This is the key: x appears only through inner products!
        
        # Inner products between test points and support vectors
        # Shape: (n_test, n_support_vectors)
        K_test = X @ self.support_vectors.T
        
        # Weight by alpha * y for each support vector
        weighted = self.support_alphas * self.support_labels  # (n_sv,)
        
        # Sum over support vectors
        decision_values = K_test @ weighted + self.b
        
        return decision_values
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels for samples in X."""
        return np.sign(self.decision_function(X))
    
    def _solve_dual_qp(self, K: np.ndarray, y: np.ndarray) -> np.ndarray:
        """
        Solve the dual QP problem (simplified SMO-like approach).
        In practice, use cvxopt, OSQP, or libsvm.
        """
        n = K.shape[0]
        alpha = np.zeros(n)
        
        # Signed Gram matrix
        Q = np.outer(y, y) * K
        
        # Simplified coordinate ascent (actual SMO is more sophisticated)
        for iteration in range(1000):
            for i in range(n):
                # Compute gradient for alpha_i
                grad = 1 - Q[i] @ alpha
                
                # Projected gradient step
                alpha_new = alpha[i] + 0.01 * grad
                alpha[i] = np.clip(alpha_new, 0, self.C)
        
        # Project to satisfy sum_i alpha_i y_i = 0 constraint
        # (simplified: in practice, SMO handles this exactly)
        
        return alpha
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate linearly separable data
    X_pos = np.random.randn(50, 2) + np.array([2, 2])
    X_neg = np.random.randn(50, 2) + np.array([-2, -2])
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*50 + [-1]*50)
    
    # Fit SVM
    svm = DualSVM(C=1.0)
    svm.fit(X, y)
    
    print(f"Number of support vectors: {len(svm.support_vectors)}")
    print(f"Fraction of training data: {len(svm.support_vectors)/len(X):.1%}")
    
    # Predict on new points
    X_test = np.array([[0, 0], [3, 3], [-3, -3]])
    predictions = svm.predict(X_test)
    print(f"Predictions: {predictions}")

Why This Matters for Kernels

We've established that the dual SVM formulation operates entirely through inner products. Now let's see why this is transformative.

The Nonlinear Classification Problem

Many real-world datasets are not linearly separable. Consider the XOR problem: points at $(1,1)$ and $(-1,-1)$ belong to class +1, while points at $(1,-1)$ and $(-1,1)$ belong to class -1. No line can separate these classes.

The classical solution is to map the data to a higher-dimensional space where the classes become linearly separable. Define a feature map $\phi: \mathbb{R}^2 \to \mathbb{R}^3$:

$$\phi(x_1, x_2) = (x_1^2, \sqrt{2}, x_1 x_2, x_2^2)$$

In this new space, the XOR pattern becomes linearly separable! The hyperplane $z_1 - z_3 = 0$ (or equivalently $x_1^2 = x_2^2$) separates the classes.

The Computational Problem

For complex problems, we might need feature spaces with thousands, millions, or even infinitely many dimensions. Explicitly computing and storing these high-dimensional feature vectors would be computationally prohibitive—or outright impossible for infinite-dimensional spaces.

The Kernel Trick Preview

Here's where the dual formulation's inner product structure becomes magical. Suppose we apply feature map $\phi$ to all data points and run SVM in the new space. The dual objective becomes:

$$L_D(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$$

We need inner products in the feature space: $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$.

The key insight: For certain feature maps, we can compute the inner product in feature space directly from the original coordinates, without ever computing the features themselves!

For the quadratic feature map above: $$\langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle = x_1^2 z_1^2 + 2x_1 x_2 z_1 z_2 + x_2^2 z_2^2 = (x_1 z_1 + x_2 z_2)^2 = (\mathbf{x}^T\mathbf{z})^2$$

The inner product in 3D feature space equals the squared inner product in original 2D space!

Define the kernel function $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z})^2$. We can:

Compute the $n \times n$ kernel matrix with entries $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Solve the dual SVM using this kernel matrix
Make predictions using kernel evaluations

All without ever computing the 3D feature vectors!

The Power of the Kernel Trick
Approach	Feature Space	Computation Required	Feasibility
Explicit mapping	$\phi(\mathbf{x})$ in $\mathbb{R}^D$	Compute all $n$ feature vectors, then all $n^2$ inner products	Requires $O(nD)$ storage; impossible if $D = \infty$
Kernel trick	Implicit via $k(\mathbf{x}, \mathbf{z})$	Compute $n^2$ kernel evaluations directly	$O(n^2)$ regardless of feature space dimension
Example: polynomial	$D = \binom{d+p}{p}$ (huge)	Explicit: compute all binomial features	Kernel: $(\mathbf{x}^T\mathbf{z} + c)^p$ in $O(d)$

Mathematical Prerequisites for Kernels

Before we dive into specific kernel functions in the next page, let's establish the mathematical criteria that make a function a valid kernel.

What Makes a Valid Kernel?

Not every function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ corresponds to an inner product in some feature space. A function is a valid kernel (also called a positive definite kernel or Mercer kernel) if and only if it satisfies:

Definition (Positive Definiteness): A kernel function $k$ is positive definite if for any finite set of points ${\mathbf{x}_1, \ldots, \mathbf{x}n}$, the Gram matrix $K$ with entries $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is positive semidefinite.

Equivalently, for all choices of real numbers $c_1, \ldots, c_n$: $$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$

Mercer's Theorem

Mercer's theorem guarantees that every positive definite kernel corresponds to an inner product in some (possibly infinite-dimensional) Hilbert space. This means there exists a feature map $\phi$ such that $k(\mathbf{x}, \mathbf{z}) = \langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle$. The theorem provides the theoretical foundation for kernel methods.

Properties of Valid Kernels

Valid kernels satisfy several useful closure properties:

Scaling: If $k$ is a kernel and $c > 0$, then $c \cdot k$ is a kernel.
Sum: If $k_1$ and $k_2$ are kernels, then $k_1 + k_2$ is a kernel.
Product: If $k_1$ and $k_2$ are kernels, then $k_1 \cdot k_2$ is a kernel.
Polynomial of kernel: If $k$ is a kernel and $p(x) = \sum_i a_i x^i$ with $a_i \geq 0$, then $p(k)$ is a kernel.
Exponential of kernel: If $k$ is a kernel, then $\exp(k)$ is a kernel.
Normalization: If $k$ is a kernel, then $\tilde{k}(\mathbf{x}, \mathbf{z}) = \frac{k(\mathbf{x}, \mathbf{z})}{\sqrt{k(\mathbf{x}, \mathbf{x}) k(\mathbf{z}, \mathbf{z})}}$ is a kernel.

These properties allow us to construct complex kernels from simpler building blocks, designing similarity measures tailored to specific problem domains.

Key Takeaways from Kernel Properties

•Gram matrix test: We can verify a function is a valid kernel by checking positive semidefiniteness of its Gram matrix on any test points.
•Compositional construction: Complex kernels can be built by combining simpler ones using sum, product, and functional compositions.
•Infinite-dimensional spaces are OK: Thanks to Mercer's theorem, we can work with kernels that correspond to infinite-dimensional feature spaces—completely inaccessible by explicit methods.
•Domain flexibility: Kernels can be defined on any input space where we can define similarity: vectors, strings, graphs, distributions, etc.

Computational Considerations

Understanding when the dual formulation provides computational advantages is crucial for practical applications.

When Dual Wins: The n vs d Tradeoff

The primal formulation scales with feature dimension $d$; the dual scales with sample size $n$. This creates a fundamental tradeoff:

Primal is preferred when:

$d \ll n$ (few features, many samples)
Linear classification is sufficient
Explicit feature representation is needed
Online/streaming updates are required

Dual is preferred when:

$d \gg n$ (many features, few samples)
Kernel methods are needed for nonlinearity
Feature dimension is infinite
Interpretability through support vectors is valuable

For kernel SVMs, the dual is mandatory when the implicit feature space is infinite-dimensional (e.g., RBF kernel).

Complexity Comparison: Primal vs Dual SVM
Operation	Primal Complexity	Dual Complexity	Notes
Training	$O(nd + d^3)$ or iterative	$O(n^3)$ or SMO: $O(n^2)$ per pass	SMO makes dual practical for large $n$
Storing model	$O(d)$ for $\mathbf{w}$	$O(n_{sv} \cdot d)$ for support vectors	$n_{sv}$ typically small fraction of $n$
Prediction (one point)	$O(d)$	$O(n_{sv} \cdot d)$ or $O(n_{sv})$ with kernel	Kernel form avoids explicit $d$
Memory (training)	$O(nd)$	$O(n^2)$ for kernel matrix	Can be reduced with matrix-free methods

The Kernel Matrix Bottleneck

The $O(n^2)$ storage for the kernel matrix becomes impractical for large datasets (e.g., $n = 100,000$ requires 80GB for double-precision floats). Modern methods use approximations: Nyström method, random Fourier features, and low-rank decompositions. We'll explore these in the optimization module.

The Support Vector Sparsity Advantage

A remarkable property of SVMs is that the solution is sparse in the dual representation: most $\alpha_i^* = 0$. Only the support vectors (points on or within the margin) have non-zero weights.

This sparsity has practical benefits:

Compact model: We only need to store support vectors, not all training data.
Fast prediction: Summation is over $n_{sv}$ terms, not $n$ terms.
Interpretability: Support vectors are the 'critical' training examples that define the decision boundary.
Generalization insight: Fewer support vectors relative to training set size often indicates better generalization.

The sparsity level depends on the problem difficulty and the regularization parameter $C$. Harder problems (more overlap) typically require more support vectors.

Summary: The Foundation for Kernel Methods

We've established why the dual formulation is not just an alternative to the primal—it's the architectural foundation that enables the extraordinary power of kernel SVMs.

Key Takeaways

•Inner product structure: The dual SVM formulation expresses optimization and prediction entirely through inner products between data points.
•Independence from dimensionality: The dual uses $n$ variables regardless of feature dimension $d$, making it tractable when $d$ is large or infinite.
•The Gram matrix encodes geometry: All information needed for SVM is captured in the $n \times n$ matrix of pairwise inner products.
•Gateway to nonlinearity: By replacing inner products with kernel functions, we can implicitly work in high-dimensional feature spaces without computational explosion.
•Sparse solutions: Only support vectors contribute to the model, enabling compact storage and efficient prediction.
•Mercer's theorem: Validates that kernel functions correspond to inner products in legitimate (possibly infinite-dimensional) feature spaces.

What's Next

In the next page, we'll explore the kernel substitution in detail: how to replace inner products with kernel functions, the mathematical justification, and the practical implications. We'll see how this substitution transforms linear SVMs into powerful nonlinear classifiers capable of learning complex decision boundaries.

Page Complete

You now understand why the dual formulation is essential for kernel SVMs. The inner product structure isn't just mathematical elegance—it's the key that unlocks nonlinear classification through implicit feature mapping.

1 / 5

Loading learning content...

Machine LearningKernel SVM

Kernel Support Vector Machines

LevelAdvanced

Duration90 mins

TopicKernel SVM

1 / 5

Dual Form Advantage

The Key to Nonlinear Classification

What You Will Master

The Primal and Dual Perspectives

Let us revisit the two equivalent formulations of the SVM problem to understand what each representation reveals.

The Primal Problem

For soft-margin SVM, the primal optimization problem is:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$

subject to: $$y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0, \quad i = 1, \ldots, n$$

Here, we optimize directly over:

The weight vector $\mathbf{w} \in \mathbb{R}^d$ (where $d$ is the feature dimensionality)
The bias term $b \in \mathbb{R}$
The slack variables $\boldsymbol{\xi} = (\xi_1, \ldots, \xi_n)$

The Dual Problem

By introducing Lagrange multipliers and applying the KKT conditions, we transform this into the dual problem:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$

subject to: $$\sum_{i=1}^n \alpha_i y_i = 0, \quad 0 \leq \alpha_i \leq C, \quad i = 1, \ldots, n$$

$$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$$

This is the representer theorem in action: the optimal solution is a linear combination of the training examples, weighted by the Lagrange multipliers.

The Critical Observation

Comparing Primal and Dual SVM Formulations
Aspect	Primal Formulation	Dual Formulation
Optimization variables	$\mathbf{w}, b, \boldsymbol{\xi}$ — total: $d + 1 + n$	$\boldsymbol{\alpha}$ — total: $n$
Depends on feature dim $d$	Yes — directly in $\mathbf{w}$	No — only through inner products
Data appears as	Individual feature vectors $\mathbf{x}_i$	Inner products $\mathbf{x}_i^T\mathbf{x}_j$
When $d > n$	More variables than samples	Fewer variables than features
Infinite $d$ possible?	No — cannot represent $\mathbf{w}$	Yes — through kernel trick
Solution sparsity	Not directly visible	Sparsity in $\alpha_i$ (support vectors)

The Inner Product Structure

The dual formulation's dependence on inner products is not merely algebraic convenience—it represents a fundamental shift in how we view the classification problem.

From Coordinates to Similarities

$$\mathbf{x}_i^T\mathbf{x}_j = |\mathbf{x}_i| \cdot |\mathbf{x}j| \cdot \cos(\theta{ij})$$

The Gram Matrix

The Dual Objective Geometrically

Let's interpret the dual objective function:

$$L_D(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$

Define the signed kernel matrix $\tilde{K}$ with entries $\tilde{K}{ij} = y_i y_j K{ij} = y_i y_j \mathbf{x}_i^T\mathbf{x}_j$. Then the dual objective becomes:

$$L_D(\boldsymbol{\alpha}) = \mathbf{1}^T\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^T\tilde{K}\boldsymbol{\alpha}$$

This is a standard quadratic programming problem. The Gram matrix $K$ (and hence $\tilde{K}$) is positive semidefinite for inner products of real vectors, guaranteeing a unique maximum.

gram_matrix_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
 
def compute_gram_matrix(X: np.ndarray) -> np.ndarray:
    """
    Compute the Gram matrix of inner products for dataset X.
    
    Parameters:
        X: Data matrix of shape (n_samples, n_features)
           Each row is a data point
    
    Returns:
        K: Gram matrix of shape (n_samples, n_samples)
           K[i,j] = X[i].T @ X[j] = inner product between i-th and j-th points
    
    The Gram matrix fully encodes the pairwise geometry of the data.
    For SVM, this is all we need—individual coordinates are not required.
    """
    # Efficient vectorized computation: K = X @ X.T
    # Each entry K[i,j] = sum over d of X[i,d] * X[j,d] = <x_i, x_j>
    K = X @ X.T
    return K
 
def compute_signed_gram_matrix(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute the signed Gram matrix for SVM dual formulation.
    
    Parameters:
        X: Data matrix of shape (n_samples, n_features)
        y: Labels of shape (n_samples,), values in {-1, +1}
    
    Returns:
        K_signed: Signed Gram matrix where K_signed[i,j] = y[i] * y[j] * <x_i, x_j>
    
    This is the matrix that appears in the quadratic term of the dual objective.
    """
    K = compute_gram_matrix(X)
    # Outer product of labels creates sign pattern
    y_outer = np.outer(y, y)
    K_signed = K * y_outer
    return K_signed
 
# Example: visualizing the Gram matrix structure
np.random.seed(42)
n_samples, n_features = 100, 50
X = np.random.randn(n_samples, n_features)
K = compute_gram_matrix(X)
 
print(f"Data shape: {X.shape}")
print(f"Gram matrix shape: {K.shape}")
print(f"Gram matrix is symmetric: {np.allclose(K, K.T)}")
print(f"Gram matrix rank: {np.linalg.matrix_rank(K)}")
 
# Verify positive semidefiniteness
eigenvalues = np.linalg.eigvalsh(K)
print(f"All eigenvalues ≥ 0: {np.all(eigenvalues >= -1e-10)}")
print(f"Minimum eigenvalue: {eigenvalues.min():.6f}")

Prediction Using Inner Products

The dual formulation's inner product structure extends beautifully to prediction. Once we've solved for the optimal $\boldsymbol{\alpha}^*$, how do we classify a new point $\mathbf{x}$?

The Classification Function

Recall that the optimal weight vector is: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i$$

The decision function is: $$f(\mathbf{x}) = \mathbf{w}^{T}\mathbf{x} + b^ = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i^T\mathbf{x} + b^*$$

The predicted label is $\text{sign}(f(\mathbf{x}))$.

The Crucial Observation

The test point $\mathbf{x}$ appears only through its inner products with training points $\mathbf{x}_i^T\mathbf{x}$.

This is the first half of the kernel trick: if we can compute similarities in some implicit feature space without computing the features themselves, nothing changes.

Sparsity in Practice

Computing the Bias Term

The bias $b^$ is computed using the KKT complementary slackness conditions. For any support vector $\mathbf{x}_s$ with $0 < \alpha_s^ < C$ (lying exactly on the margin), we have:

$$y_s(\mathbf{w}^{T}\mathbf{x}_s + b^) = 1$$

Solving for $b^*$:

$$b^* = y_s - \mathbf{w}^{T}\mathbf{x}s = y_s - \sum{i=1}^n \alpha_i^ y_i \mathbf{x}_i^T\mathbf{x}_s$$

Again, the computation involves only inner products. In practice, $b^*$ is averaged over all margin support vectors for numerical stability:

$$b^* = \frac{1}{|M|} \sum_{s \in M} \left( y_s - \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}_i^T\mathbf{x}_s \right)$$

where $M = {s : 0 < \alpha_s^* < C}$ is the set of free support vectors.

svm_dual_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import Tuple
 
class DualSVM:
    """
    SVM classifier using the dual formulation.
    Demonstrates that classification uses only inner products.
    """
    
    def __init__(self, C: float = 1.0):
        self.C = C
        self.alpha = None      # Lagrange multipliers
        self.support_vectors = None
        self.support_labels = None
        self.support_alphas = None
        self.b = None
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'DualSVM':
        """
        Fit the SVM by solving the dual QP problem.
        (Simplified implementation for illustration)
        """
        n_samples = X.shape[0]
        
        # Compute Gram matrix - this is where feature dimension enters
        K = X @ X.T
        
        # In practice, we'd use a QP solver here (cvxopt, OSQP, etc.)
        # For illustration, we'll use a simplified gradient ascent
        alpha = self._solve_dual_qp(K, y)
        
        # Store results
        self.alpha = alpha
        
        # Identify support vectors (alpha > small threshold)
        sv_mask = alpha > 1e-7
        self.support_vectors = X[sv_mask]
        self.support_labels = y[sv_mask]
        self.support_alphas = alpha[sv_mask]
        
        # Compute bias using free support vectors (0 < alpha < C)
        free_sv_mask = (alpha > 1e-7) & (alpha < self.C - 1e-7)
        if np.any(free_sv_mask):
            # For free SVs, the constraint is tight: y_s(w·x_s + b) = 1
            b_values = []
            for idx in np.where(free_sv_mask)[0]:
                # w·x_s = sum_i alpha_i y_i (x_i · x_s)
                w_dot_xs = np.sum(self.support_alphas * self.support_labels * 
                                   (self.support_vectors @ X[idx]))
                b_values.append(y[idx] - w_dot_xs)
            self.b = np.mean(b_values)
        else:
            # Fallback: use average over all support vectors
            self.b = 0.0  # Simplified
        
        return self
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """
        Compute decision function values for input points.
        
        Note: Classification uses ONLY inner products between
        test points and support vectors. No explicit feature
        computation is needed.
        """
        # For each test point x, compute:
        # f(x) = sum_s alpha_s y_s (x_s · x) + b
        #
        # This is the key: x appears only through inner products!
        
        # Inner products between test points and support vectors
        # Shape: (n_test, n_support_vectors)
        K_test = X @ self.support_vectors.T
        
        # Weight by alpha * y for each support vector
        weighted = self.support_alphas * self.support_labels  # (n_sv,)
        
        # Sum over support vectors
        decision_values = K_test @ weighted + self.b
        
        return decision_values
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels for samples in X."""
        return np.sign(self.decision_function(X))
    
    def _solve_dual_qp(self, K: np.ndarray, y: np.ndarray) -> np.ndarray:
        """
        Solve the dual QP problem (simplified SMO-like approach).
        In practice, use cvxopt, OSQP, or libsvm.
        """
        n = K.shape[0]
        alpha = np.zeros(n)
        
        # Signed Gram matrix
        Q = np.outer(y, y) * K
        
        # Simplified coordinate ascent (actual SMO is more sophisticated)
        for iteration in range(1000):
            for i in range(n):
                # Compute gradient for alpha_i
                grad = 1 - Q[i] @ alpha
                
                # Projected gradient step
                alpha_new = alpha[i] + 0.01 * grad
                alpha[i] = np.clip(alpha_new, 0, self.C)
        
        # Project to satisfy sum_i alpha_i y_i = 0 constraint
        # (simplified: in practice, SMO handles this exactly)
        
        return alpha
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate linearly separable data
    X_pos = np.random.randn(50, 2) + np.array([2, 2])
    X_neg = np.random.randn(50, 2) + np.array([-2, -2])
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*50 + [-1]*50)
    
    # Fit SVM
    svm = DualSVM(C=1.0)
    svm.fit(X, y)
    
    print(f"Number of support vectors: {len(svm.support_vectors)}")
    print(f"Fraction of training data: {len(svm.support_vectors)/len(X):.1%}")
    
    # Predict on new points
    X_test = np.array([[0, 0], [3, 3], [-3, -3]])
    predictions = svm.predict(X_test)
    print(f"Predictions: {predictions}")

Why This Matters for Kernels

We've established that the dual SVM formulation operates entirely through inner products. Now let's see why this is transformative.

The Nonlinear Classification Problem

The classical solution is to map the data to a higher-dimensional space where the classes become linearly separable. Define a feature map $\phi: \mathbb{R}^2 \to \mathbb{R}^3$:

$$\phi(x_1, x_2) = (x_1^2, \sqrt{2}, x_1 x_2, x_2^2)$$

In this new space, the XOR pattern becomes linearly separable! The hyperplane $z_1 - z_3 = 0$ (or equivalently $x_1^2 = x_2^2$) separates the classes.

The Computational Problem

The Kernel Trick Preview

Here's where the dual formulation's inner product structure becomes magical. Suppose we apply feature map $\phi$ to all data points and run SVM in the new space. The dual objective becomes:

$$L_D(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$$

We need inner products in the feature space: $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$.

The key insight: For certain feature maps, we can compute the inner product in feature space directly from the original coordinates, without ever computing the features themselves!

For the quadratic feature map above: $$\langle \phi(\mathbf{x}), \phi(\mathbf{z}) \rangle = x_1^2 z_1^2 + 2x_1 x_2 z_1 z_2 + x_2^2 z_2^2 = (x_1 z_1 + x_2 z_2)^2 = (\mathbf{x}^T\mathbf{z})^2$$

The inner product in 3D feature space equals the squared inner product in original 2D space!

Define the kernel function $k(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z})^2$. We can:

Compute the $n \times n$ kernel matrix with entries $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
Solve the dual SVM using this kernel matrix
Make predictions using kernel evaluations

All without ever computing the 3D feature vectors!

The Power of the Kernel Trick
Approach	Feature Space	Computation Required	Feasibility
Explicit mapping	$\phi(\mathbf{x})$ in $\mathbb{R}^D$	Compute all $n$ feature vectors, then all $n^2$ inner products	Requires $O(nD)$ storage; impossible if $D = \infty$
Kernel trick	Implicit via $k(\mathbf{x}, \mathbf{z})$	Compute $n^2$ kernel evaluations directly	$O(n^2)$ regardless of feature space dimension
Example: polynomial	$D = \binom{d+p}{p}$ (huge)	Explicit: compute all binomial features	Kernel: $(\mathbf{x}^T\mathbf{z} + c)^p$ in $O(d)$

Mathematical Prerequisites for Kernels

Before we dive into specific kernel functions in the next page, let's establish the mathematical criteria that make a function a valid kernel.

What Makes a Valid Kernel?

Equivalently, for all choices of real numbers $c_1, \ldots, c_n$: $$\sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0$$

Mercer's Theorem

Properties of Valid Kernels

Valid kernels satisfy several useful closure properties:

Scaling: If $k$ is a kernel and $c > 0$, then $c \cdot k$ is a kernel.
Sum: If $k_1$ and $k_2$ are kernels, then $k_1 + k_2$ is a kernel.
Product: If $k_1$ and $k_2$ are kernels, then $k_1 \cdot k_2$ is a kernel.
Polynomial of kernel: If $k$ is a kernel and $p(x) = \sum_i a_i x^i$ with $a_i \geq 0$, then $p(k)$ is a kernel.
Exponential of kernel: If $k$ is a kernel, then $\exp(k)$ is a kernel.
Normalization: If $k$ is a kernel, then $\tilde{k}(\mathbf{x}, \mathbf{z}) = \frac{k(\mathbf{x}, \mathbf{z})}{\sqrt{k(\mathbf{x}, \mathbf{x}) k(\mathbf{z}, \mathbf{z})}}$ is a kernel.

These properties allow us to construct complex kernels from simpler building blocks, designing similarity measures tailored to specific problem domains.

Key Takeaways from Kernel Properties

•Gram matrix test: We can verify a function is a valid kernel by checking positive semidefiniteness of its Gram matrix on any test points.
•Compositional construction: Complex kernels can be built by combining simpler ones using sum, product, and functional compositions.
•Infinite-dimensional spaces are OK: Thanks to Mercer's theorem, we can work with kernels that correspond to infinite-dimensional feature spaces—completely inaccessible by explicit methods.
•Domain flexibility: Kernels can be defined on any input space where we can define similarity: vectors, strings, graphs, distributions, etc.

Computational Considerations

Understanding when the dual formulation provides computational advantages is crucial for practical applications.

When Dual Wins: The n vs d Tradeoff

The primal formulation scales with feature dimension $d$; the dual scales with sample size $n$. This creates a fundamental tradeoff:

Primal is preferred when:

$d \ll n$ (few features, many samples)
Linear classification is sufficient
Explicit feature representation is needed
Online/streaming updates are required

Dual is preferred when:

$d \gg n$ (many features, few samples)
Kernel methods are needed for nonlinearity
Feature dimension is infinite
Interpretability through support vectors is valuable

For kernel SVMs, the dual is mandatory when the implicit feature space is infinite-dimensional (e.g., RBF kernel).

Complexity Comparison: Primal vs Dual SVM
Operation	Primal Complexity	Dual Complexity	Notes
Training	$O(nd + d^3)$ or iterative	$O(n^3)$ or SMO: $O(n^2)$ per pass	SMO makes dual practical for large $n$
Storing model	$O(d)$ for $\mathbf{w}$	$O(n_{sv} \cdot d)$ for support vectors	$n_{sv}$ typically small fraction of $n$
Prediction (one point)	$O(d)$	$O(n_{sv} \cdot d)$ or $O(n_{sv})$ with kernel	Kernel form avoids explicit $d$
Memory (training)	$O(nd)$	$O(n^2)$ for kernel matrix	Can be reduced with matrix-free methods

The Kernel Matrix Bottleneck

The Support Vector Sparsity Advantage

A remarkable property of SVMs is that the solution is sparse in the dual representation: most $\alpha_i^* = 0$. Only the support vectors (points on or within the margin) have non-zero weights.

This sparsity has practical benefits:

Compact model: We only need to store support vectors, not all training data.
Fast prediction: Summation is over $n_{sv}$ terms, not $n$ terms.
Interpretability: Support vectors are the 'critical' training examples that define the decision boundary.
Generalization insight: Fewer support vectors relative to training set size often indicates better generalization.

The sparsity level depends on the problem difficulty and the regularization parameter $C$. Harder problems (more overlap) typically require more support vectors.

Summary: The Foundation for Kernel Methods

We've established why the dual formulation is not just an alternative to the primal—it's the architectural foundation that enables the extraordinary power of kernel SVMs.

Key Takeaways

•Inner product structure: The dual SVM formulation expresses optimization and prediction entirely through inner products between data points.
•Independence from dimensionality: The dual uses $n$ variables regardless of feature dimension $d$, making it tractable when $d$ is large or infinite.
•The Gram matrix encodes geometry: All information needed for SVM is captured in the $n \times n$ matrix of pairwise inner products.
•Gateway to nonlinearity: By replacing inner products with kernel functions, we can implicitly work in high-dimensional feature spaces without computational explosion.
•Sparse solutions: Only support vectors contribute to the model, enabling compact storage and efficient prediction.
•Mercer's theorem: Validates that kernel functions correspond to inner products in legitimate (possibly infinite-dimensional) feature spaces.

What's Next

Page Complete

1 / 5