Soft Margin Svm - Learning Module

Loading content...

0/278

Dual Formulation: The Gateway to Kernel Methods

Why Duality Matters

The primal formulation of soft margin SVM—minimizing $\frac{1}{2}|\mathbf{w}|^2 + C\sum\xi_i$ subject to margin constraints—is perfectly valid and can be solved directly. So why introduce an alternative dual formulation?

The dual offers three profound advantages:

Kernel trick enabling: The dual depends on training points only through their inner products $\mathbf{x}_i^\top\mathbf{x}_j$. This allows replacing inner products with kernel functions, enabling nonlinear decision boundaries in high (even infinite) dimensional spaces.
Sparse representation: The optimal solution depends only on support vectors—points with $\alpha_i > 0$. Predictions require computing inner products with only these points, not all training data.
Optimization efficiency: For many problems, particularly with kernels, the dual is computationally preferable. The number of variables is $n$ (samples) regardless of dimensionality.

The dual formulation transforms SVM from a geometric margin maximizer into an elegant optimization over point importances, revealing deep mathematical structure.

What You Will Learn

This page derives the dual formulation of soft margin SVM from first principles. You will understand Lagrangian duality, KKT conditions, the dual optimization problem, and how the solution relates back to the primal. This mathematical framework is essential for kernel SVMs and understanding SVM optimization algorithms.

Lagrangian Duality: The Framework

Before diving into SVM specifically, let's review the general framework of Lagrangian duality for constrained optimization.

Primal problem (general form):

$$\min_{\mathbf{x}} f(\mathbf{x})$$ $$\text{s.t. } g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m$$ $$\quad\quad h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p$$

The Lagrangian:

$$L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \beta_j h_j(\mathbf{x})$$

where $\alpha_i \geq 0$ (for inequality constraints) and $\beta_j$ (unrestricted, for equality constraints) are Lagrange multipliers.

Lagrange Multiplier Intuition

Each Lagrange multiplier measures the 'price' of its constraint. αᵢ indicates how much the optimal value would improve if constraint gᵢ were relaxed slightly. At inactive constraints (gᵢ < 0 at optimum), αᵢ = 0—relaxing an already-satisfied constraint doesn't help.

The Lagrange dual function:

$$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

For each fixed $(\boldsymbol{\alpha}, \boldsymbol{\beta})$, we minimize the Lagrangian over $\mathbf{x}$.

The dual problem:

$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} g(\boldsymbol{\alpha}, \boldsymbol{\beta})$$ $$\text{s.t. } \alpha_i \geq 0$$

Key properties:

Weak duality: For any feasible primal $\mathbf{x}$ and dual $(\boldsymbol{\alpha}, \boldsymbol{\beta})$ with $\boldsymbol{\alpha} \geq 0$: $$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) \leq f(\mathbf{x})$$

The dual provides a lower bound on the primal.

Strong duality: Under certain conditions (Slater's condition for convex problems), the optimal values are equal: $$\max g = \min f$$

For SVM, strong duality holds because the problem is convex and constraints are linear.

Primal vs Dual Problems
Aspect	Primal	Dual
Optimization	min	max
Variables	Primal x	Multipliers α, β
Constraints	gᵢ(x) ≤ 0, hⱼ(x) = 0	αᵢ ≥ 0
Objective	f(x)	g(α, β) = inf_x L
Bound	Upper bound	Lower bound
At optimum	Minimizes f	Maximizes g

The Soft Margin SVM Lagrangian

Now let's apply this framework to soft margin SVM.

The primal problem:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad i = 1, \ldots, n$$ $$\quad\quad \xi_i \geq 0, \quad i = 1, \ldots, n$$

Rewriting constraints in standard form ($g_i \leq 0$): $$1 - \xi_i - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$$ $$-\xi_i \leq 0$$

The Lagrangian:

Introduce Lagrange multipliers:

$\alpha_i \geq 0$ for the margin constraint on point $i$
$\mu_i \geq 0$ for the slack non-negativity constraint on $\xi_i$

$$L(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\mu}) = \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i + \sum_i \alpha_i[1 - \xi_i - y_i(\mathbf{w}^\top\mathbf{x}_i + b)] + \sum_i \mu_i(-\xi_i)$$

Two Sets of Multipliers

We have two constraints per point: margin constraint (αᵢ) and slack non-negativity (μᵢ). The μᵢ multipliers will be eliminated during the derivation, but they are crucial for obtaining the box constraint 0 ≤ αᵢ ≤ C.

Simplified Lagrangian:

$$L = \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i + \sum_i \alpha_i - \sum_i \alpha_i\xi_i - \sum_i \alpha_i y_i(\mathbf{w}^\top\mathbf{x}_i + b) - \sum_i \mu_i \xi_i$$

Rearranging by variable:

$$L = \frac{1}{2}|\mathbf{w}|^2 - \sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i - b\sum_i \alpha_i y_i + \sum_i(C - \alpha_i - \mu_i)\xi_i + \sum_i \alpha_i$$

To find the dual, we minimize $L$ over the primal variables $(\mathbf{w}, b, \boldsymbol{\xi})$ for fixed $(\boldsymbol{\alpha}, \boldsymbol{\mu})$.

Deriving the Dual Problem

We find the dual by setting derivatives of $L$ with respect to primal variables to zero.

Step 1: Minimize over w

$$\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0$$

$$\Rightarrow \mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_i$$

This is a fundamental result: the optimal weight vector is a linear combination of training points, weighted by the Lagrange multipliers.

Step 2: Minimize over b

$$\frac{\partial L}{\partial b} = -\sum_i \alpha_i y_i = 0$$

$$\Rightarrow \sum_i \alpha_i y_i = 0$$

This is a constraint on the dual variables: the weighted sum of labels must be zero.

Step 3: Minimize over ξ

$$\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \mu_i = 0$$

$$\Rightarrow \mu_i = C - \alpha_i$$

Since $\mu_i \geq 0$, we need $\alpha_i \leq C$. Combined with $\alpha_i \geq 0$, this gives the box constraint:

$$0 \leq \alpha_i \leq C$$

The Box Constraint Emerges

The upper bound C on αᵢ arises from the slack variable constraints. In hard margin SVM, there are no slack variables, so μᵢ doesn't exist and αᵢ has no upper bound. The C parameter 'boxes in' the Lagrange multipliers, limiting how much any single point can influence the solution.

Step 4: Substitute back

Substitute $\mathbf{w}^* = \sum_j \alpha_j y_j \mathbf{x}_j$ into the Lagrangian:

$$L = \frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^\top\mathbf{x}_i + \sum_i \alpha_i$$

The $b$ and $\xi$ terms vanish due to the constraints from Steps 2 and 3.

Expanding the norm: $$\frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 = \frac{1}{2}\sum_i\sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

Simplifying: $$L = \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j - \sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j + \sum_i \alpha_i$$

$$L = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

The Dual Optimization Problem

The dual problem is to maximize the dual function subject to the dual constraints:

$$\max_{\boldsymbol{\alpha}} W(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

$$\text{subject to: } 0 \leq \alpha_i \leq C, \quad \forall i$$ $$\quad\quad\quad:: \sum_{i=1}^n \alpha_i y_i = 0$$

This is a quadratic program (QP) in standard form:

Quadratic objective: $\frac{1}{2}\boldsymbol{\alpha}^\top Q \boldsymbol{\alpha} - \mathbf{1}^\top\boldsymbol{\alpha}$ (with $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$, noting we maximize so flip sign)
Box constraints: $0 \leq \alpha_i \leq C$
One equality constraint: $\mathbf{y}^\top\boldsymbol{\alpha} = 0$

Matrix Form

In matrix notation: max α'1 - ½α'Qα, where Q is the kernel matrix with Qᵢⱼ = yᵢyⱼxᵢ'xⱼ. This is a convex QP (since Q is positive semi-definite), guaranteeing a unique global optimum.

soft_margin_dual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
import cvxpy as cp
 
def soft_margin_svm_dual(X, y, C):
    """
    Solve soft margin SVM in dual form.
    
    Dual problem:
    max  sum(alpha) - 0.5 * alpha' Q alpha
    s.t. 0 <= alpha <= C
         sum(alpha * y) = 0
         
    where Q_ij = y_i * y_j * x_i' * x_j
         
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,) with values in {-1, +1}
    C : float, regularization parameter
    
    Returns
    -------
    alpha : Lagrange multipliers
    w : weight vector (recovered from alpha)
    b : bias (recovered using support vectors)
    support_vectors : indices of support vectors
    """
    n_samples = X.shape[0]
    
    # Compute Q matrix: Q_ij = y_i * y_j * (x_i' x_j)
    # Using outer product of y and Gram matrix
    K = X @ X.T  # Gram matrix (linear kernel)
    Q = np.outer(y, y) * K  # Element-wise product
    
    # Ensure Q is positive semi-definite (numerical stability)
    Q = Q + 1e-8 * np.eye(n_samples)
    
    # Define dual variables
    alpha = cp.Variable(n_samples)
    
    # Dual objective: maximize sum(alpha) - 0.5 * alpha' Q alpha
    # CVXPY uses minimization, so we minimize the negative
    objective = cp.Minimize(
        0.5 * cp.quad_form(alpha, Q) - cp.sum(alpha)
    )
    
    # Constraints
    constraints = [
        alpha >= 0,           # Lower bound
        alpha <= C,           # Upper bound (box constraint)
        y @ alpha == 0        # Equality constraint
    ]
    
    # Solve
    problem = cp.Problem(objective, constraints)
    problem.solve()
    
    alpha_opt = alpha.value
    
    # Recover w from alpha: w = sum(alpha_i * y_i * x_i)
    w = np.sum((alpha_opt * y)[:, np.newaxis] * X, axis=0)
    
    # Find support vectors (alpha > epsilon)
    epsilon = 1e-6
    sv_mask = alpha_opt > epsilon
    support_vectors = np.where(sv_mask)[0]
    
    # Recover b using free support vectors (0 < alpha < C)
    # For these: y_i(w'x_i + b) = 1, so b = y_i - w'x_i
    free_sv_mask = (alpha_opt > epsilon) & (alpha_opt < C - epsilon)
    
    if np.any(free_sv_mask):
        b_values = y[free_sv_mask] - X[free_sv_mask] @ w
        b = np.mean(b_values)
    else:
        # If no free SVs, use bounded SVs with averaging
        # For bounded SVs: y_i(w'x_i + b) = 1 - xi_i
        # This is less precise
        b_values = y[sv_mask] - X[sv_mask] @ w
        b = np.mean(b_values)
    
    return alpha_opt, w, b, support_vectors
 
 
def analyze_dual_solution(X, y, alpha, w, b, C):
    """
    Analyze and categorize the dual solution.
    """
    epsilon = 1e-6
    n_samples = len(y)
    
    # Categorize points
    inactive = alpha < epsilon
    free_sv = (alpha > epsilon) & (alpha < C - epsilon)
    bounded_sv = alpha > C - epsilon
    
    print("Dual Solution Analysis")
    print("=" * 50)
    print(f"Total samples: {n_samples}")
    print(f"Inactive (α≈0): {np.sum(inactive)} ({100*np.sum(inactive)/n_samples:.1f}%)")
    print(f"Free SVs (0<α<C): {np.sum(free_sv)} ({100*np.sum(free_sv)/n_samples:.1f}%)")
    print(f"Bounded SVs (α≈C): {np.sum(bounded_sv)} ({100*np.sum(bounded_sv)/n_samples:.1f}%)")
    print()
    
    # Verify constraints
    print("Constraint Verification:")
    print(f"  sum(α_i y_i) = {np.sum(alpha * y):.6f} (should be ≈ 0)")
    print(f"  min(α) = {np.min(alpha):.6f} (should be ≥ 0)")
    print(f"  max(α) = {np.max(alpha):.6f} (should be ≤ C={C})")
    print()
    
    # Compute margins
    margins = y * (X @ w + b)
    
    print("Margin Analysis:")
    print(f"  Correctly classified (margin > 0): {np.sum(margins > 0)}")
    print(f"  On margin (|margin - 1| < eps): {np.sum(np.abs(margins - 1) < 0.01)}")
    print(f"  Margin violations (0 < margin < 1): {np.sum((margins > 0) & (margins < 1))}")
    print(f"  Misclassified (margin < 0): {np.sum(margins < 0)}")
    
    return {
        'inactive': np.where(inactive)[0],
        'free_sv': np.where(free_sv)[0],
        'bounded_sv': np.where(bounded_sv)[0],
        'margins': margins
    }
 
 
# Example
if __name__ == "__main__":
    np.random.seed(42)
    
    # Non-separable dataset
    n = 100
    X_pos = np.random.randn(n//2, 2) + [1, 1]
    X_neg = np.random.randn(n//2, 2) + [-1, -1]
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*(n//2) + [-1]*(n//2))
    
    # Solve dual
    alpha, w, b, sv_idx = soft_margin_svm_dual(X, y, C=1.0)
    
    print(f"w = [{w[0]:.4f}, {w[1]:.4f}]")
    print(f"b = {b:.4f}")
    print(f"Number of support vectors: {len(sv_idx)}")
    print()
    
    # Analyze
    analysis = analyze_dual_solution(X, y, alpha, w, b, C=1.0)

Key observations about the dual:

Data appears only through inner products: The term $\mathbf{x}_i^\top\mathbf{x}_j$ is the only way training data enters the dual. This enables the kernel trick.
Number of variables = number of training points: Regardless of feature dimensionality, we have $n$ dual variables. For high-dimensional data with few samples, this is advantageous.
Convexity is preserved: The dual is concave in $\boldsymbol{\alpha}$ (we maximize), with a unique global maximum.
Sparse solution: Many $\alpha_i$ will be exactly zero at optimum—only support vectors have $\alpha_i > 0$.

KKT Conditions for Soft Margin SVM

The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in convex problems. For soft margin SVM, they provide crucial insight into the structure of the solution.

The KKT conditions:

1. Stationarity: $$ abla_{\mathbf{w}} L = 0 \Rightarrow \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}i$$ $$ abla_b L = 0 \Rightarrow \sum_i \alpha_i y_i = 0$$ $$ abla{\xi_i} L = 0 \Rightarrow C - \alpha_i - \mu_i = 0$$

2. Primal feasibility: $$y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$$ $$\xi_i \geq 0$$

3. Dual feasibility: $$\alpha_i \geq 0, \quad \mu_i \geq 0$$

4. Complementary slackness: $$\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$$ $$\mu_i \xi_i = 0$$

Complementary Slackness Intuition

Complementary slackness says: 'either the constraint is active (equality holds) or its multiplier is zero.' If a constraint is inactive (strict inequality), we don't need to 'push' on it, so the multiplier is zero. If the multiplier is positive, we're pushing on the constraint, so it must be active.

Implications of KKT conditions:

From $\mu_i = C - \alpha_i$ and $\mu_i \xi_i = 0$:

If $\alpha_i < C$: $\mu_i > 0 \Rightarrow \xi_i = 0$
If $\alpha_i = C$: $\mu_i = 0 \Rightarrow \xi_i$ can be $\geq 0$

From $\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$:

If $\alpha_i > 0$: $y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1 - \xi_i$ (point is on or inside margin)
If $\alpha_i = 0$: constraint can be strict inequality (point outside margin)

Three categories of training points:

Condition	$\alpha_i$	$\xi_i$	Margin	Interpretation
$\alpha_i = 0$	0	0	$> 1$	Correctly outside margin
$0 < \alpha_i < C$	$(0, C)$	0	$= 1$	On margin (free SV)
$\alpha_i = C$	$C$	$\geq 0$	$\leq 1$	Margin violator (bounded SV)

Computing the Bias

For free support vectors (0 < αᵢ < C), we have ξᵢ = 0 and yᵢ(w'xᵢ + b) = 1. This gives b = yᵢ - w'xᵢ. In practice, we average over all free SVs for numerical stability: b = mean_i{yᵢ - w'xᵢ : 0 < αᵢ < C}.

KKT Conditions Summary
Condition Type	Equations	Interpretation
Stationarity	∇L = 0	Gradient vanishes at optimum
Primal feasibility	gᵢ(x) ≤ 0	Primal constraints satisfied
Dual feasibility	αᵢ ≥ 0	Multipliers non-negative
Complementary slackness	αᵢgᵢ(x) = 0	Inactive constraints have zero multiplier

Recovering the Primal Solution

After solving the dual, we need to recover the primal variables $(\mathbf{w}, b)$ for making predictions.

Recovering w:

From the stationarity condition: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i = \sum{i \in SV} \alpha_i^* y_i \mathbf{x}_i$$

where $SV = {i : \alpha_i^* > 0}$ is the set of support vectors. The weight vector is a weighted sum of support vectors.

Recovering b:

For any free support vector with $0 < \alpha_i^* < C$, KKT gives: $$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$$ $$\Rightarrow b^* = y_i - \mathbf{w}^{*\top}\mathbf{x}_i$$

Since $y_i \in {-1, +1}$, we can also write: $$b^* = \frac{1}{y_i} - \mathbf{w}^{\top}\mathbf{x}_i = y_i - \mathbf{w}^{\top}\mathbf{x}_i$$

For numerical stability, average over all free SVs: $$b^* = \frac{1}{|SV_{\text{free}}|}\sum_{i \in SV_{\text{free}}} (y_i - \mathbf{w}^{*\top}\mathbf{x}_i)$$

recover_primal_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
 
def recover_primal_from_dual(X, y, alpha, C, tolerance=1e-6):
    """
    Recover primal solution (w, b) from dual solution alpha.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    alpha : ndarray of shape (n_samples,), dual solution
    C : float, regularization parameter
    tolerance : float, threshold for identifying support vectors
    
    Returns
    -------
    w : weight vector
    b : bias
    sv_info : dict with support vector information
    """
    n_samples = len(y)
    
    # Identify support vector types
    is_sv = alpha > tolerance
    is_free_sv = is_sv & (alpha < C - tolerance)
    is_bounded_sv = alpha > C - tolerance
    
    # Recover w: w = sum(alpha_i * y_i * x_i)
    # Only support vectors contribute (alpha_i > 0)
    w = np.zeros(X.shape[1])
    for i in range(n_samples):
        if alpha[i] > tolerance:
            w += alpha[i] * y[i] * X[i]
    
    # Alternative (vectorized):
    # w = (alpha * y).reshape(-1, 1).T @ X
    # w = w.flatten()
    
    # Recover b using free support vectors
    free_sv_indices = np.where(is_free_sv)[0]
    
    if len(free_sv_indices) > 0:
        # For free SVs: y_i(w'x_i + b) = 1, so b = y_i - w'x_i
        b_values = []
        for i in free_sv_indices:
            b_i = y[i] - np.dot(w, X[i])
            b_values.append(b_i)
        b = np.mean(b_values)
    else:
        # No free SVs (C too small?) - use all SVs
        # Less accurate because bounded SVs have margin ≠ 1
        sv_indices = np.where(is_sv)[0]
        if len(sv_indices) > 0:
            b_values = [y[i] - np.dot(w, X[i]) for i in sv_indices]
            b = np.median(b_values)  # Median more robust
        else:
            b = 0.0  # No support vectors - degenerate case
    
    # Compile support vector information
    sv_info = {
        'sv_indices': np.where(is_sv)[0],
        'free_sv_indices': free_sv_indices,
        'bounded_sv_indices': np.where(is_bounded_sv)[0],
        'n_sv': np.sum(is_sv),
        'n_free_sv': np.sum(is_free_sv),
        'n_bounded_sv': np.sum(is_bounded_sv),
    }
    
    return w, b, sv_info
 
 
def verify_kkt_conditions(X, y, alpha, w, b, C, tolerance=1e-4):
    """
    Verify KKT conditions are satisfied.
    """
    n_samples = len(y)
    violations = []
    
    # Compute margins
    margins = y * (X @ w + b)
    
    # Compute implied slack
    slack = np.maximum(0, 1 - margins)
    
    print("KKT Verification")
    print("=" * 50)
    
    # 1. Stationarity for w: w = sum(alpha_i * y_i * x_i)
    w_reconstructed = np.sum((alpha * y)[:, np.newaxis] * X, axis=0)
    w_error = np.linalg.norm(w - w_reconstructed)
    print(f"1. Stationarity (w): ||w - Σαyx|| = {w_error:.2e}")
    if w_error > tolerance:
        violations.append("w stationarity")
    
    # 2. Stationarity for b: sum(alpha_i * y_i) = 0
    balance = np.sum(alpha * y)
    print(f"2. Stationarity (b): Σαy = {balance:.2e}")
    if abs(balance) > tolerance:
        violations.append("b stationarity")
    
    # 3. Primal feasibility: y(w'x + b) >= 1 - xi, xi >= 0
    margin_violations = margins < 1 - slack - tolerance
    print(f"3. Primal feasibility: {np.sum(margin_violations)} violations")
    if np.any(margin_violations):
        violations.append("primal feasibility")
    
    # 4. Dual feasibility: 0 <= alpha <= C
    alpha_violations = (alpha < -tolerance) | (alpha > C + tolerance)
    print(f"4. Dual feasibility: {np.sum(alpha_violations)} violations")
    if np.any(alpha_violations):
        violations.append("dual feasibility")
    
    # 5. Complementary slackness
    #    alpha_i * (y_i(w'x_i + b) - 1 + xi_i) = 0
    #    For alpha_i > 0 but < C: should have xi_i = 0, margin = 1
    #    For alpha_i = C: can have xi_i > 0
    #    For alpha_i = 0: constraint can be inactive
    
    cs_violations = 0
    for i in range(n_samples):
        if alpha[i] > tolerance and alpha[i] < C - tolerance:
            # Free SV: should have margin = 1
            if abs(margins[i] - 1) > tolerance:
                cs_violations += 1
        elif alpha[i] < tolerance:
            # Inactive: margin should be >= 1
            if margins[i] < 1 - tolerance:
                cs_violations += 1
    
    print(f"5. Complementary slackness: {cs_violations} violations")
    if cs_violations > 0:
        violations.append("complementary slackness")
    
    print()
    if violations:
        print(f"FAILED: {violations}")
    else:
        print("PASSED: All KKT conditions satisfied")
    
    return len(violations) == 0

Making predictions:

With $(\mathbf{w}^, b^)$ recovered, prediction for a new point $\mathbf{x}$ is:

$$f(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + b^ = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$$

$$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Note: predictions depend only on inner products between the new point and support vectors. This is crucial for the kernel trick.

Strong Duality and the Optimality Gap

For soft margin SVM, strong duality holds: the optimal primal and dual objective values are equal.

Primal optimal value: $$p^* = \frac{1}{2}|\mathbf{w}^|^2 + C\sum_i \xi_i^$$

Dual optimal value: $$d^* = \sum_i \alpha_i^* - \frac{1}{2}\sum_{i,j} \alpha_i^* \alpha_j^* y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

Strong duality theorem: $$p^* = d^*$$

Why does strong duality hold?

The primal is a convex optimization problem (quadratic objective, linear constraints), and Slater's condition is satisfied—there exist points strictly satisfying all inequality constraints. This guarantees strong duality.

Slater's condition for SVM: For any $\mathbf{w}$, we can choose $\xi_i$ large enough that $\xi_i > 0$ strictly and $y_i(\mathbf{w}^\top\mathbf{x}_i + b) > 1 - \xi_i$ strictly. So feasible interior points exist.

Checking Optimality

The duality gap p* - d* = 0 at optimum. During optimization, tracking the gap provides a convergence certificate: if gap < ε, you're within ε of optimal. Many SVM solvers use this as a stopping criterion.

Practical implications:

Algorithm validation: After solving, compute both primal and dual objectives. They should match (within numerical tolerance).
Stopping criterion: Optimization algorithms can monitor the duality gap $p - d$. When it falls below a threshold, we're near optimal.
Bound computation: Before convergence, the dual value provides a lower bound on the optimal primal value: $d \leq p^*$.

Computing the duality gap:

Given current primal $(\mathbf{w}, b, \boldsymbol{\xi})$ and dual $\boldsymbol{\alpha}$:

$$\text{gap} = P(\mathbf{w}, b, \boldsymbol{\xi}) - D(\boldsymbol{\alpha})$$

where $P$ is the primal objective and $D$ is the dual objective. At optimum, gap = 0.

Strong Duality Properties
Property	Implication
p* = d*	Primal and dual have same optimal value
Slater's satisfied	Strong duality holds for SVM
Gap = 0 at optimum	Convergence certificate
d ≤ p*	Dual provides lower bound during optimization
KKT ⟺ Optimality	KKT conditions are necessary and sufficient

Summary: The Power of Duality

The dual formulation of soft margin SVM reveals deep mathematical structure and enables powerful extensions. Let us consolidate the key insights:

Key Takeaways

•Dual variables are Lagrange multipliers — Each αᵢ measures the 'importance' of training point i. Non-support vectors have αᵢ = 0.
•Box constraint 0 ≤ αᵢ ≤ C — The upper bound C limits the influence of any single point, providing robustness to outliers and mislabeled examples.
•w is a linear combination of SVs — w = Σαᵢyᵢxᵢ, showing that the solution depends only on support vectors.
•Data appears only through inner products — The term xᵢᵀxⱼ is the gateway to kernel methods, enabling nonlinear boundaries.
•KKT conditions classify points — αᵢ = 0 (outside margin), 0 < αᵢ < C (on margin), αᵢ = C (inside/wrong side).
•Strong duality holds — Primal and dual optima are equal, enabling convergence certificates via duality gap.
•Dual is a convex QP — Unique global optimum, efficient solution via specialized algorithms (SMO, etc.).

What's next:

With the dual formulation in hand, we're ready to complete our understanding of soft margin SVM by exploring the geometric and intuitive interpretation of the soft margin. We'll see how the margin, support vectors, and violations combine to form a coherent picture of the SVM decision boundary.

Conceptual Milestone

You now understand the dual formulation of soft margin SVM—the mathematical heart of the SVM framework. This knowledge unlocks kernel SVMs, optimization algorithms like SMO, and provides deep insight into SVM behavior. The dual perspective will be essential for the kernel SVM module that follows.

Dual Formulation: The Gateway to Kernel Methods

Why Duality Matters

The dual offers three profound advantages:

Kernel trick enabling: The dual depends on training points only through their inner products $\mathbf{x}_i^\top\mathbf{x}_j$. This allows replacing inner products with kernel functions, enabling nonlinear decision boundaries in high (even infinite) dimensional spaces.
Sparse representation: The optimal solution depends only on support vectors—points with $\alpha_i > 0$. Predictions require computing inner products with only these points, not all training data.
Optimization efficiency: For many problems, particularly with kernels, the dual is computationally preferable. The number of variables is $n$ (samples) regardless of dimensionality.

The dual formulation transforms SVM from a geometric margin maximizer into an elegant optimization over point importances, revealing deep mathematical structure.

What You Will Learn

Lagrangian Duality: The Framework

Before diving into SVM specifically, let's review the general framework of Lagrangian duality for constrained optimization.

Primal problem (general form):

$$\min_{\mathbf{x}} f(\mathbf{x})$$ $$\text{s.t. } g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m$$ $$\quad\quad h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p$$

The Lagrangian:

$$L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \beta_j h_j(\mathbf{x})$$

where $\alpha_i \geq 0$ (for inequality constraints) and $\beta_j$ (unrestricted, for equality constraints) are Lagrange multipliers.

Lagrange Multiplier Intuition

The Lagrange dual function:

$$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

For each fixed $(\boldsymbol{\alpha}, \boldsymbol{\beta})$, we minimize the Lagrangian over $\mathbf{x}$.

The dual problem:

$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} g(\boldsymbol{\alpha}, \boldsymbol{\beta})$$ $$\text{s.t. } \alpha_i \geq 0$$

Key properties:

Weak duality: For any feasible primal $\mathbf{x}$ and dual $(\boldsymbol{\alpha}, \boldsymbol{\beta})$ with $\boldsymbol{\alpha} \geq 0$: $$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) \leq f(\mathbf{x})$$

The dual provides a lower bound on the primal.

Strong duality: Under certain conditions (Slater's condition for convex problems), the optimal values are equal: $$\max g = \min f$$

For SVM, strong duality holds because the problem is convex and constraints are linear.

Primal vs Dual Problems
Aspect	Primal	Dual
Optimization	min	max
Variables	Primal x	Multipliers α, β
Constraints	gᵢ(x) ≤ 0, hⱼ(x) = 0	αᵢ ≥ 0
Objective	f(x)	g(α, β) = inf_x L
Bound	Upper bound	Lower bound
At optimum	Minimizes f	Maximizes g

The Soft Margin SVM Lagrangian

Now let's apply this framework to soft margin SVM.

The primal problem:

Rewriting constraints in standard form ($g_i \leq 0$): $$1 - \xi_i - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$$ $$-\xi_i \leq 0$$

The Lagrangian:

Introduce Lagrange multipliers:

$\alpha_i \geq 0$ for the margin constraint on point $i$
$\mu_i \geq 0$ for the slack non-negativity constraint on $\xi_i$

Two Sets of Multipliers

Simplified Lagrangian:

$$L = \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i + \sum_i \alpha_i - \sum_i \alpha_i\xi_i - \sum_i \alpha_i y_i(\mathbf{w}^\top\mathbf{x}_i + b) - \sum_i \mu_i \xi_i$$

Rearranging by variable:

$$L = \frac{1}{2}|\mathbf{w}|^2 - \sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i - b\sum_i \alpha_i y_i + \sum_i(C - \alpha_i - \mu_i)\xi_i + \sum_i \alpha_i$$

To find the dual, we minimize $L$ over the primal variables $(\mathbf{w}, b, \boldsymbol{\xi})$ for fixed $(\boldsymbol{\alpha}, \boldsymbol{\mu})$.

Deriving the Dual Problem

We find the dual by setting derivatives of $L$ with respect to primal variables to zero.

Step 1: Minimize over w

$$\frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0$$

$$\Rightarrow \mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_i$$

This is a fundamental result: the optimal weight vector is a linear combination of training points, weighted by the Lagrange multipliers.

Step 2: Minimize over b

$$\frac{\partial L}{\partial b} = -\sum_i \alpha_i y_i = 0$$

$$\Rightarrow \sum_i \alpha_i y_i = 0$$

This is a constraint on the dual variables: the weighted sum of labels must be zero.

Step 3: Minimize over ξ

$$\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \mu_i = 0$$

$$\Rightarrow \mu_i = C - \alpha_i$$

Since $\mu_i \geq 0$, we need $\alpha_i \leq C$. Combined with $\alpha_i \geq 0$, this gives the box constraint:

$$0 \leq \alpha_i \leq C$$

The Box Constraint Emerges

Step 4: Substitute back

Substitute $\mathbf{w}^* = \sum_j \alpha_j y_j \mathbf{x}_j$ into the Lagrangian:

$$L = \frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^\top\mathbf{x}_i + \sum_i \alpha_i$$

The $b$ and $\xi$ terms vanish due to the constraints from Steps 2 and 3.

Expanding the norm: $$\frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 = \frac{1}{2}\sum_i\sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

Simplifying: $$L = \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j - \sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j + \sum_i \alpha_i$$

$$L = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

The Dual Optimization Problem

The dual problem is to maximize the dual function subject to the dual constraints:

$$\max_{\boldsymbol{\alpha}} W(\boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

$$\text{subject to: } 0 \leq \alpha_i \leq C, \quad \forall i$$ $$\quad\quad\quad:: \sum_{i=1}^n \alpha_i y_i = 0$$

This is a quadratic program (QP) in standard form:

Quadratic objective: $\frac{1}{2}\boldsymbol{\alpha}^\top Q \boldsymbol{\alpha} - \mathbf{1}^\top\boldsymbol{\alpha}$ (with $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$, noting we maximize so flip sign)
Box constraints: $0 \leq \alpha_i \leq C$
One equality constraint: $\mathbf{y}^\top\boldsymbol{\alpha} = 0$

Matrix Form

In matrix notation: max α'1 - ½α'Qα, where Q is the kernel matrix with Qᵢⱼ = yᵢyⱼxᵢ'xⱼ. This is a convex QP (since Q is positive semi-definite), guaranteeing a unique global optimum.

soft_margin_dual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
import cvxpy as cp
 
def soft_margin_svm_dual(X, y, C):
    """
    Solve soft margin SVM in dual form.
    
    Dual problem:
    max  sum(alpha) - 0.5 * alpha' Q alpha
    s.t. 0 <= alpha <= C
         sum(alpha * y) = 0
         
    where Q_ij = y_i * y_j * x_i' * x_j
         
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,) with values in {-1, +1}
    C : float, regularization parameter
    
    Returns
    -------
    alpha : Lagrange multipliers
    w : weight vector (recovered from alpha)
    b : bias (recovered using support vectors)
    support_vectors : indices of support vectors
    """
    n_samples = X.shape[0]
    
    # Compute Q matrix: Q_ij = y_i * y_j * (x_i' x_j)
    # Using outer product of y and Gram matrix
    K = X @ X.T  # Gram matrix (linear kernel)
    Q = np.outer(y, y) * K  # Element-wise product
    
    # Ensure Q is positive semi-definite (numerical stability)
    Q = Q + 1e-8 * np.eye(n_samples)
    
    # Define dual variables
    alpha = cp.Variable(n_samples)
    
    # Dual objective: maximize sum(alpha) - 0.5 * alpha' Q alpha
    # CVXPY uses minimization, so we minimize the negative
    objective = cp.Minimize(
        0.5 * cp.quad_form(alpha, Q) - cp.sum(alpha)
    )
    
    # Constraints
    constraints = [
        alpha >= 0,           # Lower bound
        alpha <= C,           # Upper bound (box constraint)
        y @ alpha == 0        # Equality constraint
    ]
    
    # Solve
    problem = cp.Problem(objective, constraints)
    problem.solve()
    
    alpha_opt = alpha.value
    
    # Recover w from alpha: w = sum(alpha_i * y_i * x_i)
    w = np.sum((alpha_opt * y)[:, np.newaxis] * X, axis=0)
    
    # Find support vectors (alpha > epsilon)
    epsilon = 1e-6
    sv_mask = alpha_opt > epsilon
    support_vectors = np.where(sv_mask)[0]
    
    # Recover b using free support vectors (0 < alpha < C)
    # For these: y_i(w'x_i + b) = 1, so b = y_i - w'x_i
    free_sv_mask = (alpha_opt > epsilon) & (alpha_opt < C - epsilon)
    
    if np.any(free_sv_mask):
        b_values = y[free_sv_mask] - X[free_sv_mask] @ w
        b = np.mean(b_values)
    else:
        # If no free SVs, use bounded SVs with averaging
        # For bounded SVs: y_i(w'x_i + b) = 1 - xi_i
        # This is less precise
        b_values = y[sv_mask] - X[sv_mask] @ w
        b = np.mean(b_values)
    
    return alpha_opt, w, b, support_vectors
 
 
def analyze_dual_solution(X, y, alpha, w, b, C):
    """
    Analyze and categorize the dual solution.
    """
    epsilon = 1e-6
    n_samples = len(y)
    
    # Categorize points
    inactive = alpha < epsilon
    free_sv = (alpha > epsilon) & (alpha < C - epsilon)
    bounded_sv = alpha > C - epsilon
    
    print("Dual Solution Analysis")
    print("=" * 50)
    print(f"Total samples: {n_samples}")
    print(f"Inactive (α≈0): {np.sum(inactive)} ({100*np.sum(inactive)/n_samples:.1f}%)")
    print(f"Free SVs (0<α<C): {np.sum(free_sv)} ({100*np.sum(free_sv)/n_samples:.1f}%)")
    print(f"Bounded SVs (α≈C): {np.sum(bounded_sv)} ({100*np.sum(bounded_sv)/n_samples:.1f}%)")
    print()
    
    # Verify constraints
    print("Constraint Verification:")
    print(f"  sum(α_i y_i) = {np.sum(alpha * y):.6f} (should be ≈ 0)")
    print(f"  min(α) = {np.min(alpha):.6f} (should be ≥ 0)")
    print(f"  max(α) = {np.max(alpha):.6f} (should be ≤ C={C})")
    print()
    
    # Compute margins
    margins = y * (X @ w + b)
    
    print("Margin Analysis:")
    print(f"  Correctly classified (margin > 0): {np.sum(margins > 0)}")
    print(f"  On margin (|margin - 1| < eps): {np.sum(np.abs(margins - 1) < 0.01)}")
    print(f"  Margin violations (0 < margin < 1): {np.sum((margins > 0) & (margins < 1))}")
    print(f"  Misclassified (margin < 0): {np.sum(margins < 0)}")
    
    return {
        'inactive': np.where(inactive)[0],
        'free_sv': np.where(free_sv)[0],
        'bounded_sv': np.where(bounded_sv)[0],
        'margins': margins
    }
 
 
# Example
if __name__ == "__main__":
    np.random.seed(42)
    
    # Non-separable dataset
    n = 100
    X_pos = np.random.randn(n//2, 2) + [1, 1]
    X_neg = np.random.randn(n//2, 2) + [-1, -1]
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*(n//2) + [-1]*(n//2))
    
    # Solve dual
    alpha, w, b, sv_idx = soft_margin_svm_dual(X, y, C=1.0)
    
    print(f"w = [{w[0]:.4f}, {w[1]:.4f}]")
    print(f"b = {b:.4f}")
    print(f"Number of support vectors: {len(sv_idx)}")
    print()
    
    # Analyze
    analysis = analyze_dual_solution(X, y, alpha, w, b, C=1.0)

Key observations about the dual:

Data appears only through inner products: The term $\mathbf{x}_i^\top\mathbf{x}_j$ is the only way training data enters the dual. This enables the kernel trick.
Number of variables = number of training points: Regardless of feature dimensionality, we have $n$ dual variables. For high-dimensional data with few samples, this is advantageous.
Convexity is preserved: The dual is concave in $\boldsymbol{\alpha}$ (we maximize), with a unique global maximum.
Sparse solution: Many $\alpha_i$ will be exactly zero at optimum—only support vectors have $\alpha_i > 0$.

KKT Conditions for Soft Margin SVM

The KKT conditions:

2. Primal feasibility: $$y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$$ $$\xi_i \geq 0$$

3. Dual feasibility: $$\alpha_i \geq 0, \quad \mu_i \geq 0$$

4. Complementary slackness: $$\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$$ $$\mu_i \xi_i = 0$$

Complementary Slackness Intuition

Implications of KKT conditions:

From $\mu_i = C - \alpha_i$ and $\mu_i \xi_i = 0$:

If $\alpha_i < C$: $\mu_i > 0 \Rightarrow \xi_i = 0$
If $\alpha_i = C$: $\mu_i = 0 \Rightarrow \xi_i$ can be $\geq 0$

From $\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$:

If $\alpha_i > 0$: $y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1 - \xi_i$ (point is on or inside margin)
If $\alpha_i = 0$: constraint can be strict inequality (point outside margin)

Three categories of training points:

Condition	$\alpha_i$	$\xi_i$	Margin	Interpretation
$\alpha_i = 0$	0	0	$> 1$	Correctly outside margin
$0 < \alpha_i < C$	$(0, C)$	0	$= 1$	On margin (free SV)
$\alpha_i = C$	$C$	$\geq 0$	$\leq 1$	Margin violator (bounded SV)

Computing the Bias

KKT Conditions Summary
Condition Type	Equations	Interpretation
Stationarity	∇L = 0	Gradient vanishes at optimum
Primal feasibility	gᵢ(x) ≤ 0	Primal constraints satisfied
Dual feasibility	αᵢ ≥ 0	Multipliers non-negative
Complementary slackness	αᵢgᵢ(x) = 0	Inactive constraints have zero multiplier

Recovering the Primal Solution

After solving the dual, we need to recover the primal variables $(\mathbf{w}, b)$ for making predictions.

Recovering w:

From the stationarity condition: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i = \sum{i \in SV} \alpha_i^* y_i \mathbf{x}_i$$

where $SV = {i : \alpha_i^* > 0}$ is the set of support vectors. The weight vector is a weighted sum of support vectors.

Recovering b:

For any free support vector with $0 < \alpha_i^* < C$, KKT gives: $$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$$ $$\Rightarrow b^* = y_i - \mathbf{w}^{*\top}\mathbf{x}_i$$

Since $y_i \in {-1, +1}$, we can also write: $$b^* = \frac{1}{y_i} - \mathbf{w}^{\top}\mathbf{x}_i = y_i - \mathbf{w}^{\top}\mathbf{x}_i$$

For numerical stability, average over all free SVs: $$b^* = \frac{1}{|SV_{\text{free}}|}\sum_{i \in SV_{\text{free}}} (y_i - \mathbf{w}^{*\top}\mathbf{x}_i)$$

recover_primal_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
 
def recover_primal_from_dual(X, y, alpha, C, tolerance=1e-6):
    """
    Recover primal solution (w, b) from dual solution alpha.
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    alpha : ndarray of shape (n_samples,), dual solution
    C : float, regularization parameter
    tolerance : float, threshold for identifying support vectors
    
    Returns
    -------
    w : weight vector
    b : bias
    sv_info : dict with support vector information
    """
    n_samples = len(y)
    
    # Identify support vector types
    is_sv = alpha > tolerance
    is_free_sv = is_sv & (alpha < C - tolerance)
    is_bounded_sv = alpha > C - tolerance
    
    # Recover w: w = sum(alpha_i * y_i * x_i)
    # Only support vectors contribute (alpha_i > 0)
    w = np.zeros(X.shape[1])
    for i in range(n_samples):
        if alpha[i] > tolerance:
            w += alpha[i] * y[i] * X[i]
    
    # Alternative (vectorized):
    # w = (alpha * y).reshape(-1, 1).T @ X
    # w = w.flatten()
    
    # Recover b using free support vectors
    free_sv_indices = np.where(is_free_sv)[0]
    
    if len(free_sv_indices) > 0:
        # For free SVs: y_i(w'x_i + b) = 1, so b = y_i - w'x_i
        b_values = []
        for i in free_sv_indices:
            b_i = y[i] - np.dot(w, X[i])
            b_values.append(b_i)
        b = np.mean(b_values)
    else:
        # No free SVs (C too small?) - use all SVs
        # Less accurate because bounded SVs have margin ≠ 1
        sv_indices = np.where(is_sv)[0]
        if len(sv_indices) > 0:
            b_values = [y[i] - np.dot(w, X[i]) for i in sv_indices]
            b = np.median(b_values)  # Median more robust
        else:
            b = 0.0  # No support vectors - degenerate case
    
    # Compile support vector information
    sv_info = {
        'sv_indices': np.where(is_sv)[0],
        'free_sv_indices': free_sv_indices,
        'bounded_sv_indices': np.where(is_bounded_sv)[0],
        'n_sv': np.sum(is_sv),
        'n_free_sv': np.sum(is_free_sv),
        'n_bounded_sv': np.sum(is_bounded_sv),
    }
    
    return w, b, sv_info
 
 
def verify_kkt_conditions(X, y, alpha, w, b, C, tolerance=1e-4):
    """
    Verify KKT conditions are satisfied.
    """
    n_samples = len(y)
    violations = []
    
    # Compute margins
    margins = y * (X @ w + b)
    
    # Compute implied slack
    slack = np.maximum(0, 1 - margins)
    
    print("KKT Verification")
    print("=" * 50)
    
    # 1. Stationarity for w: w = sum(alpha_i * y_i * x_i)
    w_reconstructed = np.sum((alpha * y)[:, np.newaxis] * X, axis=0)
    w_error = np.linalg.norm(w - w_reconstructed)
    print(f"1. Stationarity (w): ||w - Σαyx|| = {w_error:.2e}")
    if w_error > tolerance:
        violations.append("w stationarity")
    
    # 2. Stationarity for b: sum(alpha_i * y_i) = 0
    balance = np.sum(alpha * y)
    print(f"2. Stationarity (b): Σαy = {balance:.2e}")
    if abs(balance) > tolerance:
        violations.append("b stationarity")
    
    # 3. Primal feasibility: y(w'x + b) >= 1 - xi, xi >= 0
    margin_violations = margins < 1 - slack - tolerance
    print(f"3. Primal feasibility: {np.sum(margin_violations)} violations")
    if np.any(margin_violations):
        violations.append("primal feasibility")
    
    # 4. Dual feasibility: 0 <= alpha <= C
    alpha_violations = (alpha < -tolerance) | (alpha > C + tolerance)
    print(f"4. Dual feasibility: {np.sum(alpha_violations)} violations")
    if np.any(alpha_violations):
        violations.append("dual feasibility")
    
    # 5. Complementary slackness
    #    alpha_i * (y_i(w'x_i + b) - 1 + xi_i) = 0
    #    For alpha_i > 0 but < C: should have xi_i = 0, margin = 1
    #    For alpha_i = C: can have xi_i > 0
    #    For alpha_i = 0: constraint can be inactive
    
    cs_violations = 0
    for i in range(n_samples):
        if alpha[i] > tolerance and alpha[i] < C - tolerance:
            # Free SV: should have margin = 1
            if abs(margins[i] - 1) > tolerance:
                cs_violations += 1
        elif alpha[i] < tolerance:
            # Inactive: margin should be >= 1
            if margins[i] < 1 - tolerance:
                cs_violations += 1
    
    print(f"5. Complementary slackness: {cs_violations} violations")
    if cs_violations > 0:
        violations.append("complementary slackness")
    
    print()
    if violations:
        print(f"FAILED: {violations}")
    else:
        print("PASSED: All KKT conditions satisfied")
    
    return len(violations) == 0

Making predictions:

With $(\mathbf{w}^, b^)$ recovered, prediction for a new point $\mathbf{x}$ is:

$$f(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + b^ = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$$

$$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Note: predictions depend only on inner products between the new point and support vectors. This is crucial for the kernel trick.

Strong Duality and the Optimality Gap

For soft margin SVM, strong duality holds: the optimal primal and dual objective values are equal.

Primal optimal value: $$p^* = \frac{1}{2}|\mathbf{w}^|^2 + C\sum_i \xi_i^$$

Dual optimal value: $$d^* = \sum_i \alpha_i^* - \frac{1}{2}\sum_{i,j} \alpha_i^* \alpha_j^* y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

Strong duality theorem: $$p^* = d^*$$

Why does strong duality hold?

Checking Optimality

Practical implications:

Algorithm validation: After solving, compute both primal and dual objectives. They should match (within numerical tolerance).
Stopping criterion: Optimization algorithms can monitor the duality gap $p - d$. When it falls below a threshold, we're near optimal.
Bound computation: Before convergence, the dual value provides a lower bound on the optimal primal value: $d \leq p^*$.

Computing the duality gap:

Given current primal $(\mathbf{w}, b, \boldsymbol{\xi})$ and dual $\boldsymbol{\alpha}$:

$$\text{gap} = P(\mathbf{w}, b, \boldsymbol{\xi}) - D(\boldsymbol{\alpha})$$

where $P$ is the primal objective and $D$ is the dual objective. At optimum, gap = 0.

Strong Duality Properties
Property	Implication
p* = d*	Primal and dual have same optimal value
Slater's satisfied	Strong duality holds for SVM
Gap = 0 at optimum	Convergence certificate
d ≤ p*	Dual provides lower bound during optimization
KKT ⟺ Optimality	KKT conditions are necessary and sufficient

Summary: The Power of Duality

The dual formulation of soft margin SVM reveals deep mathematical structure and enables powerful extensions. Let us consolidate the key insights:

Key Takeaways

•Dual variables are Lagrange multipliers — Each αᵢ measures the 'importance' of training point i. Non-support vectors have αᵢ = 0.
•Box constraint 0 ≤ αᵢ ≤ C — The upper bound C limits the influence of any single point, providing robustness to outliers and mislabeled examples.
•w is a linear combination of SVs — w = Σαᵢyᵢxᵢ, showing that the solution depends only on support vectors.
•Data appears only through inner products — The term xᵢᵀxⱼ is the gateway to kernel methods, enabling nonlinear boundaries.
•KKT conditions classify points — αᵢ = 0 (outside margin), 0 < αᵢ < C (on margin), αᵢ = C (inside/wrong side).
•Strong duality holds — Primal and dual optima are equal, enabling convergence certificates via duality gap.
•Dual is a convex QP — Unique global optimum, efficient solution via specialized algorithms (SMO, etc.).

What's next:

Conceptual Milestone