Loading content...
The primal formulation of soft margin SVM—minimizing $\frac{1}{2}|\mathbf{w}|^2 + C\sum\xi_i$ subject to margin constraints—is perfectly valid and can be solved directly. So why introduce an alternative dual formulation?
The dual offers three profound advantages:
Kernel trick enabling: The dual depends on training points only through their inner products $\mathbf{x}_i^\top\mathbf{x}_j$. This allows replacing inner products with kernel functions, enabling nonlinear decision boundaries in high (even infinite) dimensional spaces.
Sparse representation: The optimal solution depends only on support vectors—points with $\alpha_i > 0$. Predictions require computing inner products with only these points, not all training data.
Optimization efficiency: For many problems, particularly with kernels, the dual is computationally preferable. The number of variables is $n$ (samples) regardless of dimensionality.
The dual formulation transforms SVM from a geometric margin maximizer into an elegant optimization over point importances, revealing deep mathematical structure.
This page derives the dual formulation of soft margin SVM from first principles. You will understand Lagrangian duality, KKT conditions, the dual optimization problem, and how the solution relates back to the primal. This mathematical framework is essential for kernel SVMs and understanding SVM optimization algorithms.
Before diving into SVM specifically, let's review the general framework of Lagrangian duality for constrained optimization.
Primal problem (general form):
$$\min_{\mathbf{x}} f(\mathbf{x})$$ $$\text{s.t. } g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m$$ $$\quad\quad h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p$$
The Lagrangian:
$$L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \beta_j h_j(\mathbf{x})$$
where $\alpha_i \geq 0$ (for inequality constraints) and $\beta_j$ (unrestricted, for equality constraints) are Lagrange multipliers.
Each Lagrange multiplier measures the 'price' of its constraint. αᵢ indicates how much the optimal value would improve if constraint gᵢ were relaxed slightly. At inactive constraints (gᵢ < 0 at optimum), αᵢ = 0—relaxing an already-satisfied constraint doesn't help.
The Lagrange dual function:
$$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$
For each fixed $(\boldsymbol{\alpha}, \boldsymbol{\beta})$, we minimize the Lagrangian over $\mathbf{x}$.
The dual problem:
$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} g(\boldsymbol{\alpha}, \boldsymbol{\beta})$$ $$\text{s.t. } \alpha_i \geq 0$$
Key properties:
The dual provides a lower bound on the primal.
For SVM, strong duality holds because the problem is convex and constraints are linear.
| Aspect | Primal | Dual |
|---|---|---|
| Optimization | min | max |
| Variables | Primal x | Multipliers α, β |
| Constraints | gᵢ(x) ≤ 0, hⱼ(x) = 0 | αᵢ ≥ 0 |
| Objective | f(x) | g(α, β) = inf_x L |
| Bound | Upper bound | Lower bound |
| At optimum | Minimizes f | Maximizes g |
Now let's apply this framework to soft margin SVM.
The primal problem:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad i = 1, \ldots, n$$ $$\quad\quad \xi_i \geq 0, \quad i = 1, \ldots, n$$
Rewriting constraints in standard form ($g_i \leq 0$): $$1 - \xi_i - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$$ $$-\xi_i \leq 0$$
The Lagrangian:
Introduce Lagrange multipliers:
$$L(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\mu}) = \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i + \sum_i \alpha_i[1 - \xi_i - y_i(\mathbf{w}^\top\mathbf{x}_i + b)] + \sum_i \mu_i(-\xi_i)$$
We have two constraints per point: margin constraint (αᵢ) and slack non-negativity (μᵢ). The μᵢ multipliers will be eliminated during the derivation, but they are crucial for obtaining the box constraint 0 ≤ αᵢ ≤ C.
Simplified Lagrangian:
$$L = \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i + \sum_i \alpha_i - \sum_i \alpha_i\xi_i - \sum_i \alpha_i y_i(\mathbf{w}^\top\mathbf{x}_i + b) - \sum_i \mu_i \xi_i$$
Rearranging by variable:
$$L = \frac{1}{2}|\mathbf{w}|^2 - \sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i - b\sum_i \alpha_i y_i + \sum_i(C - \alpha_i - \mu_i)\xi_i + \sum_i \alpha_i$$
To find the dual, we minimize $L$ over the primal variables $(\mathbf{w}, b, \boldsymbol{\xi})$ for fixed $(\boldsymbol{\alpha}, \boldsymbol{\mu})$.
We find the dual by setting derivatives of $L$ with respect to primal variables to zero.
Step 1: Minimize over w
$$\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0$$
$$\Rightarrow \mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_i$$
This is a fundamental result: the optimal weight vector is a linear combination of training points, weighted by the Lagrange multipliers.
Step 2: Minimize over b
$$\frac{\partial L}{\partial b} = -\sum_i \alpha_i y_i = 0$$
$$\Rightarrow \sum_i \alpha_i y_i = 0$$
This is a constraint on the dual variables: the weighted sum of labels must be zero.
Step 3: Minimize over ξ
$$\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \mu_i = 0$$
$$\Rightarrow \mu_i = C - \alpha_i$$
Since $\mu_i \geq 0$, we need $\alpha_i \leq C$. Combined with $\alpha_i \geq 0$, this gives the box constraint:
$$0 \leq \alpha_i \leq C$$
The upper bound C on αᵢ arises from the slack variable constraints. In hard margin SVM, there are no slack variables, so μᵢ doesn't exist and αᵢ has no upper bound. The C parameter 'boxes in' the Lagrange multipliers, limiting how much any single point can influence the solution.
Step 4: Substitute back
Substitute $\mathbf{w}^* = \sum_j \alpha_j y_j \mathbf{x}_j$ into the Lagrangian:
$$L = \frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^\top\mathbf{x}_i + \sum_i \alpha_i$$
The $b$ and $\xi$ terms vanish due to the constraints from Steps 2 and 3.
Expanding the norm: $$\frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 = \frac{1}{2}\sum_i\sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$
Simplifying: $$L = \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j - \sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j + \sum_i \alpha_i$$
$$L = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$
The dual problem is to maximize the dual function subject to the dual constraints:
$$\max_{\boldsymbol{\alpha}} W(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$
$$\text{subject to: } 0 \leq \alpha_i \leq C, \quad \forall i$$ $$\quad\quad\quad:: \sum_{i=1}^n \alpha_i y_i = 0$$
This is a quadratic program (QP) in standard form:
In matrix notation: max α'1 - ½α'Qα, where Q is the kernel matrix with Qᵢⱼ = yᵢyⱼxᵢ'xⱼ. This is a convex QP (since Q is positive semi-definite), guaranteeing a unique global optimum.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npimport cvxpy as cp def soft_margin_svm_dual(X, y, C): """ Solve soft margin SVM in dual form. Dual problem: max sum(alpha) - 0.5 * alpha' Q alpha s.t. 0 <= alpha <= C sum(alpha * y) = 0 where Q_ij = y_i * y_j * x_i' * x_j Parameters ---------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) with values in {-1, +1} C : float, regularization parameter Returns ------- alpha : Lagrange multipliers w : weight vector (recovered from alpha) b : bias (recovered using support vectors) support_vectors : indices of support vectors """ n_samples = X.shape[0] # Compute Q matrix: Q_ij = y_i * y_j * (x_i' x_j) # Using outer product of y and Gram matrix K = X @ X.T # Gram matrix (linear kernel) Q = np.outer(y, y) * K # Element-wise product # Ensure Q is positive semi-definite (numerical stability) Q = Q + 1e-8 * np.eye(n_samples) # Define dual variables alpha = cp.Variable(n_samples) # Dual objective: maximize sum(alpha) - 0.5 * alpha' Q alpha # CVXPY uses minimization, so we minimize the negative objective = cp.Minimize( 0.5 * cp.quad_form(alpha, Q) - cp.sum(alpha) ) # Constraints constraints = [ alpha >= 0, # Lower bound alpha <= C, # Upper bound (box constraint) y @ alpha == 0 # Equality constraint ] # Solve problem = cp.Problem(objective, constraints) problem.solve() alpha_opt = alpha.value # Recover w from alpha: w = sum(alpha_i * y_i * x_i) w = np.sum((alpha_opt * y)[:, np.newaxis] * X, axis=0) # Find support vectors (alpha > epsilon) epsilon = 1e-6 sv_mask = alpha_opt > epsilon support_vectors = np.where(sv_mask)[0] # Recover b using free support vectors (0 < alpha < C) # For these: y_i(w'x_i + b) = 1, so b = y_i - w'x_i free_sv_mask = (alpha_opt > epsilon) & (alpha_opt < C - epsilon) if np.any(free_sv_mask): b_values = y[free_sv_mask] - X[free_sv_mask] @ w b = np.mean(b_values) else: # If no free SVs, use bounded SVs with averaging # For bounded SVs: y_i(w'x_i + b) = 1 - xi_i # This is less precise b_values = y[sv_mask] - X[sv_mask] @ w b = np.mean(b_values) return alpha_opt, w, b, support_vectors def analyze_dual_solution(X, y, alpha, w, b, C): """ Analyze and categorize the dual solution. """ epsilon = 1e-6 n_samples = len(y) # Categorize points inactive = alpha < epsilon free_sv = (alpha > epsilon) & (alpha < C - epsilon) bounded_sv = alpha > C - epsilon print("Dual Solution Analysis") print("=" * 50) print(f"Total samples: {n_samples}") print(f"Inactive (α≈0): {np.sum(inactive)} ({100*np.sum(inactive)/n_samples:.1f}%)") print(f"Free SVs (0<α<C): {np.sum(free_sv)} ({100*np.sum(free_sv)/n_samples:.1f}%)") print(f"Bounded SVs (α≈C): {np.sum(bounded_sv)} ({100*np.sum(bounded_sv)/n_samples:.1f}%)") print() # Verify constraints print("Constraint Verification:") print(f" sum(α_i y_i) = {np.sum(alpha * y):.6f} (should be ≈ 0)") print(f" min(α) = {np.min(alpha):.6f} (should be ≥ 0)") print(f" max(α) = {np.max(alpha):.6f} (should be ≤ C={C})") print() # Compute margins margins = y * (X @ w + b) print("Margin Analysis:") print(f" Correctly classified (margin > 0): {np.sum(margins > 0)}") print(f" On margin (|margin - 1| < eps): {np.sum(np.abs(margins - 1) < 0.01)}") print(f" Margin violations (0 < margin < 1): {np.sum((margins > 0) & (margins < 1))}") print(f" Misclassified (margin < 0): {np.sum(margins < 0)}") return { 'inactive': np.where(inactive)[0], 'free_sv': np.where(free_sv)[0], 'bounded_sv': np.where(bounded_sv)[0], 'margins': margins } # Exampleif __name__ == "__main__": np.random.seed(42) # Non-separable dataset n = 100 X_pos = np.random.randn(n//2, 2) + [1, 1] X_neg = np.random.randn(n//2, 2) + [-1, -1] X = np.vstack([X_pos, X_neg]) y = np.array([1]*(n//2) + [-1]*(n//2)) # Solve dual alpha, w, b, sv_idx = soft_margin_svm_dual(X, y, C=1.0) print(f"w = [{w[0]:.4f}, {w[1]:.4f}]") print(f"b = {b:.4f}") print(f"Number of support vectors: {len(sv_idx)}") print() # Analyze analysis = analyze_dual_solution(X, y, alpha, w, b, C=1.0)Key observations about the dual:
Data appears only through inner products: The term $\mathbf{x}_i^\top\mathbf{x}_j$ is the only way training data enters the dual. This enables the kernel trick.
Number of variables = number of training points: Regardless of feature dimensionality, we have $n$ dual variables. For high-dimensional data with few samples, this is advantageous.
Convexity is preserved: The dual is concave in $\boldsymbol{\alpha}$ (we maximize), with a unique global maximum.
Sparse solution: Many $\alpha_i$ will be exactly zero at optimum—only support vectors have $\alpha_i > 0$.
The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in convex problems. For soft margin SVM, they provide crucial insight into the structure of the solution.
The KKT conditions:
1. Stationarity: $$ abla_{\mathbf{w}} L = 0 \Rightarrow \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}i$$ $$ abla_b L = 0 \Rightarrow \sum_i \alpha_i y_i = 0$$ $$ abla{\xi_i} L = 0 \Rightarrow C - \alpha_i - \mu_i = 0$$
2. Primal feasibility: $$y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$$ $$\xi_i \geq 0$$
3. Dual feasibility: $$\alpha_i \geq 0, \quad \mu_i \geq 0$$
4. Complementary slackness: $$\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$$ $$\mu_i \xi_i = 0$$
Complementary slackness says: 'either the constraint is active (equality holds) or its multiplier is zero.' If a constraint is inactive (strict inequality), we don't need to 'push' on it, so the multiplier is zero. If the multiplier is positive, we're pushing on the constraint, so it must be active.
Implications of KKT conditions:
From $\mu_i = C - \alpha_i$ and $\mu_i \xi_i = 0$:
From $\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$:
Three categories of training points:
| Condition | $\alpha_i$ | $\xi_i$ | Margin | Interpretation |
|---|---|---|---|---|
| $\alpha_i = 0$ | 0 | 0 | $> 1$ | Correctly outside margin |
| $0 < \alpha_i < C$ | $(0, C)$ | 0 | $= 1$ | On margin (free SV) |
| $\alpha_i = C$ | $C$ | $\geq 0$ | $\leq 1$ | Margin violator (bounded SV) |
For free support vectors (0 < αᵢ < C), we have ξᵢ = 0 and yᵢ(w'xᵢ + b) = 1. This gives b = yᵢ - w'xᵢ. In practice, we average over all free SVs for numerical stability: b = mean_i{yᵢ - w'xᵢ : 0 < αᵢ < C}.
| Condition Type | Equations | Interpretation |
|---|---|---|
| Stationarity | ∇L = 0 | Gradient vanishes at optimum |
| Primal feasibility | gᵢ(x) ≤ 0 | Primal constraints satisfied |
| Dual feasibility | αᵢ ≥ 0 | Multipliers non-negative |
| Complementary slackness | αᵢgᵢ(x) = 0 | Inactive constraints have zero multiplier |
After solving the dual, we need to recover the primal variables $(\mathbf{w}, b)$ for making predictions.
Recovering w:
From the stationarity condition: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i = \sum{i \in SV} \alpha_i^* y_i \mathbf{x}_i$$
where $SV = {i : \alpha_i^* > 0}$ is the set of support vectors. The weight vector is a weighted sum of support vectors.
Recovering b:
For any free support vector with $0 < \alpha_i^* < C$, KKT gives: $$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$$ $$\Rightarrow b^* = y_i - \mathbf{w}^{*\top}\mathbf{x}_i$$
Since $y_i \in {-1, +1}$, we can also write: $$b^* = \frac{1}{y_i} - \mathbf{w}^{\top}\mathbf{x}_i = y_i - \mathbf{w}^{\top}\mathbf{x}_i$$
For numerical stability, average over all free SVs: $$b^* = \frac{1}{|SV_{\text{free}}|}\sum_{i \in SV_{\text{free}}} (y_i - \mathbf{w}^{*\top}\mathbf{x}_i)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import numpy as np def recover_primal_from_dual(X, y, alpha, C, tolerance=1e-6): """ Recover primal solution (w, b) from dual solution alpha. Parameters ---------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) alpha : ndarray of shape (n_samples,), dual solution C : float, regularization parameter tolerance : float, threshold for identifying support vectors Returns ------- w : weight vector b : bias sv_info : dict with support vector information """ n_samples = len(y) # Identify support vector types is_sv = alpha > tolerance is_free_sv = is_sv & (alpha < C - tolerance) is_bounded_sv = alpha > C - tolerance # Recover w: w = sum(alpha_i * y_i * x_i) # Only support vectors contribute (alpha_i > 0) w = np.zeros(X.shape[1]) for i in range(n_samples): if alpha[i] > tolerance: w += alpha[i] * y[i] * X[i] # Alternative (vectorized): # w = (alpha * y).reshape(-1, 1).T @ X # w = w.flatten() # Recover b using free support vectors free_sv_indices = np.where(is_free_sv)[0] if len(free_sv_indices) > 0: # For free SVs: y_i(w'x_i + b) = 1, so b = y_i - w'x_i b_values = [] for i in free_sv_indices: b_i = y[i] - np.dot(w, X[i]) b_values.append(b_i) b = np.mean(b_values) else: # No free SVs (C too small?) - use all SVs # Less accurate because bounded SVs have margin ≠ 1 sv_indices = np.where(is_sv)[0] if len(sv_indices) > 0: b_values = [y[i] - np.dot(w, X[i]) for i in sv_indices] b = np.median(b_values) # Median more robust else: b = 0.0 # No support vectors - degenerate case # Compile support vector information sv_info = { 'sv_indices': np.where(is_sv)[0], 'free_sv_indices': free_sv_indices, 'bounded_sv_indices': np.where(is_bounded_sv)[0], 'n_sv': np.sum(is_sv), 'n_free_sv': np.sum(is_free_sv), 'n_bounded_sv': np.sum(is_bounded_sv), } return w, b, sv_info def verify_kkt_conditions(X, y, alpha, w, b, C, tolerance=1e-4): """ Verify KKT conditions are satisfied. """ n_samples = len(y) violations = [] # Compute margins margins = y * (X @ w + b) # Compute implied slack slack = np.maximum(0, 1 - margins) print("KKT Verification") print("=" * 50) # 1. Stationarity for w: w = sum(alpha_i * y_i * x_i) w_reconstructed = np.sum((alpha * y)[:, np.newaxis] * X, axis=0) w_error = np.linalg.norm(w - w_reconstructed) print(f"1. Stationarity (w): ||w - Σαyx|| = {w_error:.2e}") if w_error > tolerance: violations.append("w stationarity") # 2. Stationarity for b: sum(alpha_i * y_i) = 0 balance = np.sum(alpha * y) print(f"2. Stationarity (b): Σαy = {balance:.2e}") if abs(balance) > tolerance: violations.append("b stationarity") # 3. Primal feasibility: y(w'x + b) >= 1 - xi, xi >= 0 margin_violations = margins < 1 - slack - tolerance print(f"3. Primal feasibility: {np.sum(margin_violations)} violations") if np.any(margin_violations): violations.append("primal feasibility") # 4. Dual feasibility: 0 <= alpha <= C alpha_violations = (alpha < -tolerance) | (alpha > C + tolerance) print(f"4. Dual feasibility: {np.sum(alpha_violations)} violations") if np.any(alpha_violations): violations.append("dual feasibility") # 5. Complementary slackness # alpha_i * (y_i(w'x_i + b) - 1 + xi_i) = 0 # For alpha_i > 0 but < C: should have xi_i = 0, margin = 1 # For alpha_i = C: can have xi_i > 0 # For alpha_i = 0: constraint can be inactive cs_violations = 0 for i in range(n_samples): if alpha[i] > tolerance and alpha[i] < C - tolerance: # Free SV: should have margin = 1 if abs(margins[i] - 1) > tolerance: cs_violations += 1 elif alpha[i] < tolerance: # Inactive: margin should be >= 1 if margins[i] < 1 - tolerance: cs_violations += 1 print(f"5. Complementary slackness: {cs_violations} violations") if cs_violations > 0: violations.append("complementary slackness") print() if violations: print(f"FAILED: {violations}") else: print("PASSED: All KKT conditions satisfied") return len(violations) == 0Making predictions:
With $(\mathbf{w}^, b^)$ recovered, prediction for a new point $\mathbf{x}$ is:
$$f(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + b^ = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$$
$$\hat{y} = \text{sign}(f(\mathbf{x}))$$
Note: predictions depend only on inner products between the new point and support vectors. This is crucial for the kernel trick.
For soft margin SVM, strong duality holds: the optimal primal and dual objective values are equal.
Primal optimal value: $$p^* = \frac{1}{2}|\mathbf{w}^|^2 + C\sum_i \xi_i^$$
Dual optimal value: $$d^* = \sum_i \alpha_i^* - \frac{1}{2}\sum_{i,j} \alpha_i^* \alpha_j^* y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$
Strong duality theorem: $$p^* = d^*$$
Why does strong duality hold?
The primal is a convex optimization problem (quadratic objective, linear constraints), and Slater's condition is satisfied—there exist points strictly satisfying all inequality constraints. This guarantees strong duality.
Slater's condition for SVM: For any $\mathbf{w}$, we can choose $\xi_i$ large enough that $\xi_i > 0$ strictly and $y_i(\mathbf{w}^\top\mathbf{x}_i + b) > 1 - \xi_i$ strictly. So feasible interior points exist.
The duality gap p* - d* = 0 at optimum. During optimization, tracking the gap provides a convergence certificate: if gap < ε, you're within ε of optimal. Many SVM solvers use this as a stopping criterion.
Practical implications:
Algorithm validation: After solving, compute both primal and dual objectives. They should match (within numerical tolerance).
Stopping criterion: Optimization algorithms can monitor the duality gap $p - d$. When it falls below a threshold, we're near optimal.
Bound computation: Before convergence, the dual value provides a lower bound on the optimal primal value: $d \leq p^*$.
Computing the duality gap:
Given current primal $(\mathbf{w}, b, \boldsymbol{\xi})$ and dual $\boldsymbol{\alpha}$:
$$\text{gap} = P(\mathbf{w}, b, \boldsymbol{\xi}) - D(\boldsymbol{\alpha})$$
where $P$ is the primal objective and $D$ is the dual objective. At optimum, gap = 0.
| Property | Implication |
|---|---|
| p* = d* | Primal and dual have same optimal value |
| Slater's satisfied | Strong duality holds for SVM |
| Gap = 0 at optimum | Convergence certificate |
| d ≤ p* | Dual provides lower bound during optimization |
| KKT ⟺ Optimality | KKT conditions are necessary and sufficient |
The dual formulation of soft margin SVM reveals deep mathematical structure and enables powerful extensions. Let us consolidate the key insights:
What's next:
With the dual formulation in hand, we're ready to complete our understanding of soft margin SVM by exploring the geometric and intuitive interpretation of the soft margin. We'll see how the margin, support vectors, and violations combine to form a coherent picture of the SVM decision boundary.
You now understand the dual formulation of soft margin SVM—the mathematical heart of the SVM framework. This knowledge unlocks kernel SVMs, optimization algorithms like SMO, and provides deep insight into SVM behavior. The dual perspective will be essential for the kernel SVM module that follows.