Loading content...
In the previous page, we derived the SVM dual problem through Lagrangian methods. Now we examine this dual problem in detail—understanding its structure, why it's often preferred over the primal, and how it connects to the key concepts of support vectors and kernels.
The dual perspective is not merely an alternative computational approach; it reveals fundamental aspects of the SVM solution that are hidden in the primal formulation. Most importantly, the dual shows that:
This page provides the deep understanding of the dual that underlies modern SVM implementations and kernel methods.
By the end of this page, you will: (1) fully understand the dual problem structure, (2) compare primal vs dual in terms of complexity and applicability, (3) understand strong duality and its implications, (4) see how kernels naturally emerge, and (5) analyze support vector identification through dual variables.
Let's restate the dual problem derived in the previous page and examine each component carefully.
$$\begin{aligned} \max_{\boldsymbol{\alpha} \in \mathbb{R}^n} \quad & W(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}j) \ \text{subject to} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,n \ & \sum{i=1}^n \alpha_i y_i = 0 \end{aligned}$$
Matrix-Vector Form:
Define:
Then:
$$\begin{aligned} \max_{\boldsymbol{\alpha}} \quad & W(\boldsymbol{\alpha}) = \mathbf{1}^\top\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha} \ \text{s.t.} \quad & \boldsymbol{\alpha} \geq \mathbf{0} \ & \mathbf{y}^\top\boldsymbol{\alpha} = 0 \end{aligned}$$
| Component | Expression | Interpretation |
|---|---|---|
| Variables | $\boldsymbol{\alpha} \in \mathbb{R}^n$ | One multiplier per training example |
| Linear term | $\mathbf{1}^\top\boldsymbol{\alpha}$ | Encourages non-zero multipliers |
| Quadratic term | $\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha}$ | Penalizes similar examples with large weights |
| Box constraints | $\alpha_i \geq 0$ | Dual feasibility (from Lagrangian) |
| Equality constraint | $\mathbf{y}^\top\boldsymbol{\alpha} = 0$ | From stationarity wrt bias $b$ |
The dual problem has several important structural properties that make it amenable to efficient solution.
1. Concavity of the Objective:
The dual objective $W(\boldsymbol{\alpha}) = \mathbf{1}^\top\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha}$ is a concave quadratic function because:
Maximizing a concave function is equivalent to minimizing a convex function, so the dual is a convex optimization problem.
2. Convexity of the Feasible Region:
The constraints define a convex polyhedron:
$$\mathcal{F}_D = {\boldsymbol{\alpha} : \boldsymbol{\alpha} \geq \mathbf{0}, \mathbf{y}^\top\boldsymbol{\alpha} = 0}$$
The dual is a concave QP (equivalently, convex minimization QP). This guarantees:
3. Boundedness:
The feasible region $\mathcal{F}_D$ is unbounded (extends to infinity in certain directions). However:
4. Uniqueness:
The optimal $\boldsymbol{\alpha}^*$ is not always unique because $\mathbf{Q}$ may be only PSD (not strictly positive definite). However:
5. Constraint Qualification:
Slater's condition requires a strictly feasible point (satisfies all inequalities strictly). For SVM dual with $n > 1$ examples from both classes, such a point exists (e.g., small $\alpha_i$ summing appropriately), so strong duality holds.
Understanding when to solve the primal versus the dual is crucial for practical SVM implementation.
Dimension Analysis:
| Aspect | Primal | Dual |
|---|---|---|
| Variables | $d + 1$ (features + bias) | $n$ (examples) |
| Inequality constraints | $n$ | $n$ (box constraints) |
| Equality constraints | 0 | 1 |
| QP matrix size | $(d+1) \times (d+1)$ | $n \times n$ |
When Primal Is Preferred:
Use primal when $d < n$ (fewer features than examples). Use dual when $d \geq n$, when using kernels, or when you need support vector identification. For very large $n$ with linear kernels, specialized primal methods (e.g., LIBLINEAR) are fastest.
When Dual Is Preferred:
Complexity Comparison:
| Operation | Primal | Dual |
|---|---|---|
| QP matrix formation | $O(nd)$ | $O(n^2d)$ |
| QP solution (IP) | $O((d+1)^3 \sqrt{n})$ | $O(n^3)$ |
| Kernel evaluation | N/A | Already in $\mathbf{Q}$ |
| Prediction (linear) | $O(d)$ | $O(n_{SV} \cdot d)$ |
| Prediction (kernel) | N/A | $O(n_{SV})$ kernel evals |
| Scenario | Recommended | Reasoning |
|---|---|---|
| $d = 10$, $n = 100,000$, linear | Primal | Many examples, few features |
| $d = 10,000$, $n = 1,000$, linear | Either | Moderate dimensions both ways |
| $d = $ any, kernel SVM | Dual | Kernel trick requires dual |
| Text data, $d = 50,000$, $n = 5,000$ | Dual | High $d$, kernel if nonlinear |
| Images, $d = 10^6$, $n = 10,000$ | Dual + kernel | Explicit features impractical |
Strong duality is the remarkable result that for SVM, the primal and dual have the same optimal value. This is not true for all optimization problems and is crucial for SVM theory.
Statement:
Let $p^* = \frac{1}{2}|\mathbf{w}^|^2$ be the optimal primal value and $d^ = W(\boldsymbol{\alpha}^*)$ be the optimal dual value. Then:
$$p^* = d^*$$
Why Strong Duality Holds:
By the strong duality theorem for convex optimization, these conditions guarantee $p^* = d^*$.
The duality gap is $p^* - d^$. For non-convex problems, this gap can be positive (weak duality: $d^ \leq p^$). For SVM, strong duality guarantees zero duality gap: $p^ = d^*$.
This means solving the dual gives exactly the same optimal value as solving the primal.
Practical Implications:
Verifying Strong Duality:
At optimum:
$$p^* = \frac{1}{2}|\mathbf{w}^|^2 = \frac{1}{2}\left|\sum_i \alpha_i^ y_i \mathbf{x}_i\right|^2$$
$$d^* = \sum_i \alpha_i^* - \frac{1}{2}\sum_{i,j} \alpha_i^* \alpha_j^* y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$$
Expanding and simplifying shows $p^* = d^*$. This equality serves as a numerical check: if the gap is non-zero (beyond numerical tolerance), something is wrong.
Complementary slackness conditions link the primal and dual solutions, revealing which training points are support vectors.
The Condition:
For each constraint $i$:
$$\alpha_i^* \left[ y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) - 1 \right] = 0$$
This says: for each $i$, either $\alpha_i^* = 0$ OR $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$ (or both).
Interpretation:
Complementary slackness directly identifies support vectors:
Classification of Training Points:
| Condition | Type | Geometric Location |
|---|---|---|
| $\alpha_i^* > 0$, margin active | Support Vector | On margin hyperplane |
| $\alpha_i^* = 0$, margin slack | Non-SV | Beyond margin |
Why This Matters:
Numerical Thresholding:
In practice, we use a tolerance $\epsilon$ (e.g., $10^{-6}$) to identify SVs: $$\alpha_i^* > \epsilon \implies \text{support vector}$$
This handles numerical precision issues where true zeros might be very small positive numbers.
Support vectors are the cornerstone of SVM theory. Let's analyze them more deeply.
Number of Support Vectors:
For hard margin SVM with separable data:
What Determines Which Points Are SVs?
Points become support vectors when they:
The Role of SVs in Prediction:
$$f(\mathbf{x}) = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$$
Prediction requires only:
The key insight: the decision boundary is determined by a small subset of training points. This is fundamentally different from, say, logistic regression where all points influence the solution. SVM's sparsity enables: (1) efficient prediction, (2) model interpretability, (3) robustness to adding/removing non-SVs.
Support Vector Geometry:
Visualize in 2D with two classes:
Balance Property:
Recall $\sum_i \alpha_i^* y_i = 0$. Among SVs only: $$\sum_{i \in SV^+} \alpha_i^* = \sum_{i \in SV^-} \alpha_i^*$$
where $SV^+$ and $SV^-$ are positive and negative SVs respectively. The 'total weight' from each class is equal.
Typical SV Count:
Empirical rule of thumb:
The dual formulation's most profound implication is enabling the kernel trick—using nonlinear feature mappings without explicit computation.
Observation: Data Appears Only in Inner Products
In the dual:
We never need $\mathbf{x}_i$ itself—only $\mathbf{x}_i^\top\mathbf{x}_j$ and $\mathbf{x}_i^\top\mathbf{x}$.
The Kernel Substitution:
A kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ computes inner products in some (possibly infinite-dimensional) feature space:
$$k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^\top\phi(\mathbf{z})$$
where $\phi: \mathbb{R}^d \to \mathcal{H}$ maps to a higher-dimensional (or infinite) space.
Replace $\mathbf{x}_i^\top\mathbf{x}_j$ with $k(\mathbf{x}_i, \mathbf{x}_j)$ everywhere:
Dual with kernels: $$\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$$
Prediction with kernels: $$f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$
We operate implicitly in a high-dimensional space without computing $\phi(\mathbf{x})$!
Common Kernels:
| Kernel | Formula | Feature Space |
|---|---|---|
| Linear | $\mathbf{x}^\top\mathbf{z}$ | Original space |
| Polynomial | $(\mathbf{x}^\top\mathbf{z} + c)^p$ | Polynomial features |
| Gaussian RBF | $\exp(-\gamma|\mathbf{x} - \mathbf{z}|^2)$ | Infinite-dimensional |
| Sigmoid | $\tanh(\gamma \mathbf{x}^\top\mathbf{z} + c)$ | Neural network-like |
Why the Primal Can't Do This:
In the primal, we minimize $\frac{1}{2}|\mathbf{w}|^2$ where $\mathbf{w} \in \mathbb{R}^d$. If $d$ is infinite (e.g., Gaussian kernel), we can't represent $\mathbf{w}$ explicitly! The dual via the representer theorem gives:
$$\mathbf{w}^* = \sum_i \alpha_i^* y_i \phi(\mathbf{x}_i)$$
We never need $\mathbf{w}^$ explicitly; we only need $\mathbf{w}^{\top}\phi(\mathbf{x}) = \sum_i \alpha_i^* y_i k(\mathbf{x}_i, \mathbf{x})$.
The kernel trick is the primary reason the dual formulation is so important in practice. Without the dual, we'd be restricted to linear SVMs in the original feature space. The dual unlocks powerful nonlinear classifiers that work in implicitly high-dimensional (even infinite) spaces.
The dual QP has specific structure that enables specialized solution methods.
Standard QP Form:
Converting to minimization: $$\begin{aligned} \min_{\boldsymbol{\alpha}} \quad & \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha} - \mathbf{1}^\top\boldsymbol{\alpha} \ \text{s.t.} \quad & \boldsymbol{\alpha} \geq \mathbf{0} \ & \mathbf{y}^\top\boldsymbol{\alpha} = 0 \end{aligned}$$
Approaches:
1. General QP Solvers (Interior Point)
2. Decomposition Methods
Sequential Minimal Optimization (SMO) is the most famous SVM solver. It's a decomposition method that optimizes pairs of variables at a time:
SMO is detailed in the SVM Optimization module; it's the method behind LIBSVM.
3. Coordinate Descent
4. Gradient Methods
Memory Management:
| Challenge | Solution |
|---|---|
| $\mathbf{Q}$ is $n \times n$ | Compute $Q_{ij}$ on-demand |
| Kernel matrix storage | Cache frequently used entries |
| Many $Q_{ij}$ needed | LRU cache, shrinking heuristics |
Practical Libraries:
When working with the dual formulation in practice, several considerations arise.
Computing the Kernel Matrix:
Caching Strategies:
For large $n$, can't store full $\mathbf{Q}$:
Numerical Issues:
Computing $b^*$:
After solving for $\boldsymbol{\alpha}^$, compute $b^$ using SVs:
$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i k(\mathbf{x}_i, \mathbf{x}_k) \right)$$
Averaging over SVs improves numerical stability.
Validation:
After solving, verify:
The dual formulation is central to SVM theory and practice. Let's consolidate what we've learned:
The final page of this module covers the KKT conditions in detail—the complete set of necessary and sufficient conditions for optimality. We'll see how KKT encapsulates everything we've learned: stationarity, feasibility, complementary slackness, and dual feasibility. Understanding KKT completes the theoretical foundation and provides the basis for analyzing optimization algorithms.
The Complete Picture:
$$\text{Primal} \xrightarrow{\text{Lagrangian}} \text{Derive Dual} \xrightarrow{\text{Structure}} \text{Analyze Properties} \xrightarrow{\text{Kernel Trick}} \text{Nonlinear SVM}$$
The dual is not just an alternative—it's the natural home for kernel methods and the perspective that reveals SVM's sparsity. Every modern SVM implementation is built on the dual formulation.