Hard Margin Svm - Learning Module

Loading content...

0/245

Dual Problem

The Dual Perspective on Maximum Margin

In the previous page, we derived the SVM dual problem through Lagrangian methods. Now we examine this dual problem in detail—understanding its structure, why it's often preferred over the primal, and how it connects to the key concepts of support vectors and kernels.

The dual perspective is not merely an alternative computational approach; it reveals fundamental aspects of the SVM solution that are hidden in the primal formulation. Most importantly, the dual shows that:

Data appears only through inner products — enabling the kernel trick
The solution is sparse — only support vectors matter
Support vectors are explicitly identified — through non-zero multipliers

This page provides the deep understanding of the dual that underlies modern SVM implementations and kernel methods.

Learning Objectives

By the end of this page, you will: (1) fully understand the dual problem structure, (2) compare primal vs dual in terms of complexity and applicability, (3) understand strong duality and its implications, (4) see how kernels naturally emerge, and (5) analyze support vector identification through dual variables.

The Dual Problem: Complete Statement

Let's restate the dual problem derived in the previous page and examine each component carefully.

Hard Margin SVM Dual (Maximization Form)

$$\begin{aligned} \max_{\boldsymbol{\alpha} \in \mathbb{R}^n} \quad & W(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}j) \ \text{subject to} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,n \ & \sum{i=1}^n \alpha_i y_i = 0 \end{aligned}$$

Matrix-Vector Form:

Define:

$\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n$: Lagrange multipliers
$\mathbf{y} = (y_1, \ldots, y_n)^\top \in {-1, +1}^n$: Labels
$\mathbf{Q} \in \mathbb{R}^{n \times n}$: Dual Hessian with $Q_{ij} = y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$
$\mathbf{1} = (1, 1, \ldots, 1)^\top \in \mathbb{R}^n$: Ones vector

Then:

$$\begin{aligned} \max_{\boldsymbol{\alpha}} \quad & W(\boldsymbol{\alpha}) = \mathbf{1}^\top\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha} \ \text{s.t.} \quad & \boldsymbol{\alpha} \geq \mathbf{0} \ & \mathbf{y}^\top\boldsymbol{\alpha} = 0 \end{aligned}$$

Dual Problem Components
Component	Expression	Interpretation
Variables	$\boldsymbol{\alpha} \in \mathbb{R}^n$	One multiplier per training example
Linear term	$\mathbf{1}^\top\boldsymbol{\alpha}$	Encourages non-zero multipliers
Quadratic term	$\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha}$	Penalizes similar examples with large weights
Box constraints	$\alpha_i \geq 0$	Dual feasibility (from Lagrangian)
Equality constraint	$\mathbf{y}^\top\boldsymbol{\alpha} = 0$	From stationarity wrt bias $b$

Properties of the Dual Problem

The dual problem has several important structural properties that make it amenable to efficient solution.

1. Concavity of the Objective:

The dual objective $W(\boldsymbol{\alpha}) = \mathbf{1}^\top\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha}$ is a concave quadratic function because:

The Hessian is $\nabla^2 W = -\mathbf{Q}$
$\mathbf{Q}$ is positive semidefinite (PSD)
Therefore $-\mathbf{Q}$ is negative semidefinite, making $W$ concave

Maximizing a concave function is equivalent to minimizing a convex function, so the dual is a convex optimization problem.

2. Convexity of the Feasible Region:

The constraints define a convex polyhedron:

$$\mathcal{F}_D = {\boldsymbol{\alpha} : \boldsymbol{\alpha} \geq \mathbf{0}, \mathbf{y}^\top\boldsymbol{\alpha} = 0}$$

$\boldsymbol{\alpha} \geq \mathbf{0}$: Non-negative orthant (convex)
$\mathbf{y}^\top\boldsymbol{\alpha} = 0$: Hyperplane through origin (convex)
Intersection of convex sets is convex

Key Property: Convex Optimization

The dual is a concave QP (equivalently, convex minimization QP). This guarantees:

Any local optimum is a global optimum
Efficient polynomial-time algorithms exist
Numerical solvers provide reliable solutions

3. Boundedness:

The feasible region $\mathcal{F}_D$ is unbounded (extends to infinity in certain directions). However:

The objective $W$ is concave, hence bounded above on any compact set
The equality constraint $\mathbf{y}^\top\boldsymbol{\alpha} = 0$ restricts directions
For well-posed problems (separable data), the optimal $\boldsymbol{\alpha}^*$ is finite

4. Uniqueness:

The optimal $\boldsymbol{\alpha}^*$ is not always unique because $\mathbf{Q}$ may be only PSD (not strictly positive definite). However:

The primal optimal $\mathbf{w}^*$ is unique (strictly convex in $\mathbf{w}$)
Different $\boldsymbol{\alpha}^$ yield the same $\mathbf{w}^ = \sum \alpha_i^* y_i \mathbf{x}_i$
Support vector identification may vary slightly at boundaries

5. Constraint Qualification:

Slater's condition requires a strictly feasible point (satisfies all inequalities strictly). For SVM dual with $n > 1$ examples from both classes, such a point exists (e.g., small $\alpha_i$ summing appropriately), so strong duality holds.

Summary of Dual Properties

•Convex optimization: Concave objective + convex feasible region = tractable problem
•QP structure: Amenable to standard QP solvers
•Finite optimal value: Under separability assumption
•Strong duality: Primal and dual have same optimal value
•Unique $\mathbf{w}^*$: Even if $\boldsymbol{\alpha}^*$ is not unique

Primal vs Dual: When to Use Which

Understanding when to solve the primal versus the dual is crucial for practical SVM implementation.

Dimension Analysis:

Aspect	Primal	Dual
Variables	$d + 1$ (features + bias)	$n$ (examples)
Inequality constraints	$n$	$n$ (box constraints)
Equality constraints	0	1
QP matrix size	$(d+1) \times (d+1)$	$n \times n$

When Primal Is Preferred:

Low-dimensional features ($d \ll n$): Primal has fewer variables
Linear SVM with many examples: Primal avoids $O(n^2)$ kernel matrix
Online/streaming settings: Primal more amenable to SGD
Memory constraints: Primal needs $O(d^2)$ vs dual's $O(n^2)$

Rule of Thumb

Use primal when $d < n$ (fewer features than examples). Use dual when $d \geq n$, when using kernels, or when you need support vector identification. For very large $n$ with linear kernels, specialized primal methods (e.g., LIBLINEAR) are fastest.

When Dual Is Preferred:

High-dimensional or infinite-dimensional features ($d \gg n$): Dual has fewer effective dimensions
Kernel methods: Dual naturally supports kernels without explicit feature computation
Support vector analysis: Dual directly identifies SVs via $\alpha_i > 0$
Sparsity exploitation: Only SVs needed for prediction
Theoretical analysis: Dual provides cleaner interpretations

Complexity Comparison:

Operation	Primal	Dual
QP matrix formation	$O(nd)$	$O(n^2d)$
QP solution (IP)	$O((d+1)^3 \sqrt{n})$	$O(n^3)$
Kernel evaluation	N/A	Already in $\mathbf{Q}$
Prediction (linear)	$O(d)$	$O(n_{SV} \cdot d)$
Prediction (kernel)	N/A	$O(n_{SV})$ kernel evals

Decision Guide: Primal vs Dual
Scenario	Recommended	Reasoning
$d = 10$, $n = 100,000$, linear	Primal	Many examples, few features
$d = 10,000$, $n = 1,000$, linear	Either	Moderate dimensions both ways
$d = $ any, kernel SVM	Dual	Kernel trick requires dual
Text data, $d = 50,000$, $n = 5,000$	Dual	High $d$, kernel if nonlinear
Images, $d = 10^6$, $n = 10,000$	Dual + kernel	Explicit features impractical

Strong Duality and Its Implications

Strong duality is the remarkable result that for SVM, the primal and dual have the same optimal value. This is not true for all optimization problems and is crucial for SVM theory.

Statement:

Let $p^* = \frac{1}{2}|\mathbf{w}^|^2$ be the optimal primal value and $d^ = W(\boldsymbol{\alpha}^*)$ be the optimal dual value. Then:

$$p^* = d^*$$

Why Strong Duality Holds:

Convexity: Primal is convex (quadratic objective, linear constraints)
Slater's condition: There exists a strictly feasible primal point (true for separable data)

By the strong duality theorem for convex optimization, these conditions guarantee $p^* = d^*$.

The Duality Gap

The duality gap is $p^* - d^$. For non-convex problems, this gap can be positive (weak duality: $d^ \leq p^$). For SVM, strong duality guarantees zero duality gap: $p^ = d^*$.

This means solving the dual gives exactly the same optimal value as solving the primal.

Practical Implications:

Solver flexibility: Solve whichever is more convenient; both give the same answer
Optimality verification: Check that primal and dual objectives match
Complementary slackness: Enabled by strong duality, links primal and dual solutions

Verifying Strong Duality:

At optimum:

$$p^* = \frac{1}{2}|\mathbf{w}^|^2 = \frac{1}{2}\left|\sum_i \alpha_i^ y_i \mathbf{x}_i\right|^2$$

$$d^* = \sum_i \alpha_i^* - \frac{1}{2}\sum_{i,j} \alpha_i^* \alpha_j^* y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$$

Expanding and simplifying shows $p^* = d^*$. This equality serves as a numerical check: if the gap is non-zero (beyond numerical tolerance), something is wrong.

Complementary Slackness Conditions

Complementary slackness conditions link the primal and dual solutions, revealing which training points are support vectors.

The Condition:

For each constraint $i$:

$$\alpha_i^* \left[ y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) - 1 \right] = 0$$

This says: for each $i$, either $\alpha_i^* = 0$ OR $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$ (or both).

Interpretation:

Support Vector Identification

Complementary slackness directly identifies support vectors:

If $\alpha_i^* > 0$: Point $i$ is a support vector, lying exactly on the margin ($y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$)
If $\alpha_i^* = 0$: Point $i$ is not a support vector, lies strictly beyond the margin ($y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) > 1$)

Classification of Training Points:

Condition	Type	Geometric Location
$\alpha_i^* > 0$, margin active	Support Vector	On margin hyperplane
$\alpha_i^* = 0$, margin slack	Non-SV	Beyond margin

Why This Matters:

Prediction Efficiency: Only SVs contribute to $\mathbf{w}^* = \sum_{i: \alpha_i^* > 0} \alpha_i^* y_i \mathbf{x}_i$
Model Sparsity: Typically $|SV| \ll n$, so model stores few points
Interpretability: SVs are the 'critical' examples that define the decision boundary
Sensitivity Analysis: Removing non-SVs doesn't change the solution

Numerical Thresholding:

In practice, we use a tolerance $\epsilon$ (e.g., $10^{-6}$) to identify SVs: $$\alpha_i^* > \epsilon \implies \text{support vector}$$

This handles numerical precision issues where true zeros might be very small positive numbers.

Deeper Analysis of Support Vectors

Support vectors are the cornerstone of SVM theory. Let's analyze them more deeply.

Number of Support Vectors:

For hard margin SVM with separable data:

$n_{SV} \geq 2$: At least one SV from each class (to define the margin)
Typically $n_{SV} \ll n$: Most points are strictly inside their class region
Upper bound: $n_{SV} \leq n$ (all points could be SVs in degenerate cases)

What Determines Which Points Are SVs?

Points become support vectors when they:

Are closest to the decision boundary among their class
Lie exactly on the margin hyperplane
Are necessary to define the maximum margin solution

The Role of SVs in Prediction:

$$f(\mathbf{x}) = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$$

Prediction requires only:

Storage: SVs and their $\alpha_i^*$, $y_i$ values
Computation: $|SV|$ inner products

SV Sparsity: The Source of SVM Efficiency

The key insight: the decision boundary is determined by a small subset of training points. This is fundamentally different from, say, logistic regression where all points influence the solution. SVM's sparsity enables: (1) efficient prediction, (2) model interpretability, (3) robustness to adding/removing non-SVs.

Support Vector Geometry:

Visualize in 2D with two classes:

SVs form the 'closest approach' between classes
The decision boundary is the perpendicular bisector of the line connecting SVs from opposite classes (for balanced cases)
Moving any SV changes the boundary; moving non-SVs doesn't

Balance Property:

Recall $\sum_i \alpha_i^* y_i = 0$. Among SVs only: $$\sum_{i \in SV^+} \alpha_i^* = \sum_{i \in SV^-} \alpha_i^*$$

where $SV^+$ and $SV^-$ are positive and negative SVs respectively. The 'total weight' from each class is equal.

Typical SV Count:

Empirical rule of thumb:

Well-separated data: Few SVs (maybe 5-20% of data)
Overlapping classes (soft margin): More SVs
Noisy data: Many SVs (need more points to define boundary)
Kernel complexity: More complex kernels may have more SVs

Key Facts About Support Vectors

•Minimal sufficient set: SVs are the minimal subset needed to define the solution
•On the margin: All SVs lie exactly on their class's margin hyperplane
•Non-zero multipliers: $\alpha_i^* > 0$ if and only if $i$ is an SV
•Class balance: Total $\alpha$ weight is equal for both classes
•Boundary determiners: The hyperplane depends only on SVs
•Robust to removal: Removing non-SVs doesn't change the solution

Connection to the Kernel Trick

The dual formulation's most profound implication is enabling the kernel trick—using nonlinear feature mappings without explicit computation.

Observation: Data Appears Only in Inner Products

In the dual:

Objective: $\sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$ — inner products only
Prediction: $f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i (\mathbf{x}_i^\top\mathbf{x}) + b$ — inner products only
Computing $b$: Uses inner products only

We never need $\mathbf{x}_i$ itself—only $\mathbf{x}_i^\top\mathbf{x}_j$ and $\mathbf{x}_i^\top\mathbf{x}$.

The Kernel Substitution:

A kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ computes inner products in some (possibly infinite-dimensional) feature space:

$$k(\mathbf{x}, \mathbf{z}) = \phi(\mathbf{x})^\top\phi(\mathbf{z})$$

where $\phi: \mathbb{R}^d \to \mathcal{H}$ maps to a higher-dimensional (or infinite) space.

The Kernel Trick

Replace $\mathbf{x}_i^\top\mathbf{x}_j$ with $k(\mathbf{x}_i, \mathbf{x}_j)$ everywhere:

Dual with kernels: $$\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$$

Prediction with kernels: $$f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$

We operate implicitly in a high-dimensional space without computing $\phi(\mathbf{x})$!

Common Kernels:

Kernel	Formula	Feature Space
Linear	$\mathbf{x}^\top\mathbf{z}$	Original space
Polynomial	$(\mathbf{x}^\top\mathbf{z} + c)^p$	Polynomial features
Gaussian RBF	$\exp(-\gamma\|\mathbf{x} - \mathbf{z}\|^2)$	Infinite-dimensional
Sigmoid	$\tanh(\gamma \mathbf{x}^\top\mathbf{z} + c)$	Neural network-like

Why the Primal Can't Do This:

In the primal, we minimize $\frac{1}{2}|\mathbf{w}|^2$ where $\mathbf{w} \in \mathbb{R}^d$. If $d$ is infinite (e.g., Gaussian kernel), we can't represent $\mathbf{w}$ explicitly! The dual via the representer theorem gives:

$$\mathbf{w}^* = \sum_i \alpha_i^* y_i \phi(\mathbf{x}_i)$$

We never need $\mathbf{w}^$ explicitly; we only need $\mathbf{w}^{\top}\phi(\mathbf{x}) = \sum_i \alpha_i^* y_i k(\mathbf{x}_i, \mathbf{x})$.

The Dual Is Essential for Kernels

The kernel trick is the primary reason the dual formulation is so important in practice. Without the dual, we'd be restricted to linear SVMs in the original feature space. The dual unlocks powerful nonlinear classifiers that work in implicitly high-dimensional (even infinite) spaces.

Solving the Dual QP

The dual QP has specific structure that enables specialized solution methods.

Standard QP Form:

Converting to minimization: $$\begin{aligned} \min_{\boldsymbol{\alpha}} \quad & \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha} - \mathbf{1}^\top\boldsymbol{\alpha} \ \text{s.t.} \quad & \boldsymbol{\alpha} \geq \mathbf{0} \ & \mathbf{y}^\top\boldsymbol{\alpha} = 0 \end{aligned}$$

Approaches:

1. General QP Solvers (Interior Point)

Form the $n \times n$ matrix $\mathbf{Q}$
Solve using standard QP software
Time: $O(n^3)$, Space: $O(n^2)$
Practical for $n \lesssim 10,000$

2. Decomposition Methods

Work with subsets of variables (working sets)
Only compute needed entries of $\mathbf{Q}$
Scale to larger problems

The SMO Algorithm

Sequential Minimal Optimization (SMO) is the most famous SVM solver. It's a decomposition method that optimizes pairs of variables at a time:

Select two variables $\alpha_i, \alpha_j$ to optimize (working set)
Solve the 2-variable subproblem analytically (closed-form!)
Update and repeat until convergence

SMO is detailed in the SVM Optimization module; it's the method behind LIBSVM.

3. Coordinate Descent

Optimize one variable at a time (subject to equality constraint, need pairs)
Simple, efficient for large sparse problems
Used in LIBLINEAR for linear SVM

4. Gradient Methods

Projected gradient descent
Stochastic gradient methods for very large $n$
Approximate but scalable

Memory Management:

Challenge	Solution
$\mathbf{Q}$ is $n \times n$	Compute $Q_{ij}$ on-demand
Kernel matrix storage	Cache frequently used entries
Many $Q_{ij}$ needed	LRU cache, shrinking heuristics

Practical Libraries:

LIBSVM: SMO-based, supports kernels
LIBLINEAR: Coordinate descent, linear only, very fast
scikit-learn: Wraps LIBSVM/LIBLINEAR

Practical Considerations for Dual Formulation

When working with the dual formulation in practice, several considerations arise.

Computing the Kernel Matrix:

$\mathbf{Q}$ has $n^2$ entries
Each entry $Q_{ij} = y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$ requires a kernel evaluation
For linear kernel: $O(d)$ per entry → $O(n^2 d)$ total
For RBF kernel: $O(d)$ per entry (distance computation)

Caching Strategies:

For large $n$, can't store full $\mathbf{Q}$:

LRU Cache: Keep recently used $Q_{ij}$ values
Shrinking: Remove variables at bounds from working set
On-demand: Compute $Q_{ij}$ when needed, discard after

Numerical Issues:

Potential Numerical Problems

•Ill-conditioned $\mathbf{Q}$: If data is near-collinear, $\mathbf{Q}$ may be nearly singular. Regularization helps (soft margin).
•Large $\alpha$ values: May indicate poor convergence or degeneracy. Check data for outliers.
•Equality constraint violation: Numerical precision can cause $\sum \alpha_i y_i \neq 0$ exactly. Accept small tolerance.
•Many SVs: If most points are SVs, margin is tight; consider regularization or data quality.
•Infeasibility: If no separating hyperplane exists, solver may not converge. Use soft margin.

Computing $b^*$:

After solving for $\boldsymbol{\alpha}^$, compute $b^$ using SVs:

$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i k(\mathbf{x}_i, \mathbf{x}_k) \right)$$

Averaging over SVs improves numerical stability.

Validation:

After solving, verify:

$\alpha_i \geq 0$ for all $i$
$|\sum_i \alpha_i y_i| < \epsilon$ (equality constraint)
Complementary slackness holds approximately
Primal-dual gap is small

Summary: The Power of the Dual

The dual formulation is central to SVM theory and practice. Let's consolidate what we've learned:

Key Takeaways

•Dual structure: Maximizing $\sum \alpha_i - \frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha}$ s.t. $\boldsymbol{\alpha} \geq 0$, $\mathbf{y}^\top\boldsymbol{\alpha} = 0$
•Convexity: Concave QP with convex feasible region—tractable optimization
•Strong duality: Primal and dual have same optimal value
•Primal vs dual choice: Depends on $n$ vs $d$ and whether kernels are needed
•Complementary slackness: Identifies support vectors via $\alpha_i^* > 0$
•Support vector sparsity: Typically few SVs, enabling efficient prediction
•Kernel trick: Data appears only via inner products—kernels substitute directly
•Solution methods: SMO, decomposition, coordinate descent for scalability

What's Next

The final page of this module covers the KKT conditions in detail—the complete set of necessary and sufficient conditions for optimality. We'll see how KKT encapsulates everything we've learned: stationarity, feasibility, complementary slackness, and dual feasibility. Understanding KKT completes the theoretical foundation and provides the basis for analyzing optimization algorithms.

The Complete Picture:

$$\text{Primal} \xrightarrow{\text{Lagrangian}} \text{Derive Dual} \xrightarrow{\text{Structure}} \text{Analyze Properties} \xrightarrow{\text{Kernel Trick}} \text{Nonlinear SVM}$$

The dual is not just an alternative—it's the natural home for kernel methods and the perspective that reveals SVM's sparsity. Every modern SVM implementation is built on the dual formulation.