Hard Margin Svm - Learning Module

Loading content...

0/245

KKT Conditions

The Complete Optimality Portrait

Throughout this module, we have encountered various conditions: stationarity conditions from the Lagrangian, complementary slackness, dual feasibility, and primal feasibility. The Karush-Kuhn-Tucker (KKT) conditions unify all of these into a single, coherent optimality framework.

The KKT conditions are the fundamental necessary and sufficient conditions for optimality in constrained optimization. For convex problems like SVM, a point is optimal if and only if it satisfies the KKT conditions. This makes KKT the theoretical workhorse for:

Verifying solutions — Check if a candidate is truly optimal
Designing algorithms — Many optimization methods directly solve for KKT points
Analyzing solutions — Understand the structure of the optimum
Sensitivity analysis — How does the solution change with data?

Learning Objectives

By the end of this page, you will: (1) understand the complete KKT conditions for general constrained optimization, (2) derive the specific KKT conditions for hard margin SVM, (3) interpret each condition geometrically and computationally, (4) use KKT for solution verification, and (5) understand why KKT conditions characterize optimality.

The General KKT Conditions

Before specializing to SVM, let's establish the general KKT framework for constrained optimization.

The General Problem:

$$\begin{aligned} \min_{\mathbf{x} \in \mathbb{R}^n} \quad & f(\mathbf{x}) \ \text{s.t.} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1,\ldots,m \ & h_j(\mathbf{x}) = 0, \quad j = 1,\ldots,p \end{aligned}$$

The KKT Conditions:

A point $\mathbf{x}^$ is a KKT point if there exist multipliers $\boldsymbol{\lambda}^ \in \mathbb{R}^m$ and $\boldsymbol{\nu}^* \in \mathbb{R}^p$ such that:

1. Stationarity: $$\nabla f(\mathbf{x}^) + \sum_{i=1}^m \lambda_i^ \nabla g_i(\mathbf{x}^) + \sum_{j=1}^p \nu_j^ \nabla h_j(\mathbf{x}^*) = \mathbf{0}$$

2. Primal Feasibility: $$g_i(\mathbf{x}^) \leq 0 \quad \forall i, \qquad h_j(\mathbf{x}^) = 0 \quad \forall j$$

3. Dual Feasibility: $$\lambda_i^* \geq 0 \quad \forall i$$

4. Complementary Slackness: $$\lambda_i^* g_i(\mathbf{x}^*) = 0 \quad \forall i$$

KKT Theorem (Convex Case)

For convex optimization problems (convex $f$ and $g_i$, affine $h_j$) satisfying a constraint qualification (e.g., Slater's condition):

$$\mathbf{x}^* \text{ is optimal} \iff \mathbf{x}^* \text{ satisfies KKT conditions}$$

KKT conditions are both necessary AND sufficient for optimality.

Interpretation of Each Condition:

Condition	Interpretation
Stationarity	At optimum, gradient of Lagrangian vanishes; no improving direction within constraints
Primal Feasibility	Solution satisfies all original constraints
Dual Feasibility	Multipliers for inequalities are non-negative (correct 'sign' for penalization)
Complementary Slackness	Multiplier is zero for inactive constraints; positive only for tight constraints

The Lagrangian Connection:

KKT conditions are optimality conditions for the Lagrangian:

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \nu_j h_j(\mathbf{x})$$

Stationarity says $\nabla_\mathbf{x} \mathcal{L} = \mathbf{0}$ at optimum. The other conditions ensure the Lagrangian is being used correctly.

SVM Problem in KKT Framework

Let's map the hard margin SVM problem to the general KKT framework.

The SVM Primal:

$$\begin{aligned} \min_{\mathbf{w}, b} \quad & f(\mathbf{w}, b) = \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0, \quad i = 1,\ldots,n \end{aligned}$$

Mapping to General Form:

General	SVM Specific
Decision variable $\mathbf{x}$	$(\mathbf{w}, b) \in \mathbb{R}^{d+1}$
Objective $f$	$\frac{1}{2}\|\mathbf{w}\|^2$
Inequality constraints $g_i$	$1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)$
Equality constraints $h_j$	None
Multipliers $\lambda_i$	$\alpha_i$ (using SVM notation)

Gradients Needed:

$$\nabla_\mathbf{w} f = \mathbf{w}$$ $$\frac{\partial f}{\partial b} = 0$$

$$\nabla_\mathbf{w} g_i = -y_i \mathbf{x}_i$$ $$\frac{\partial g_i}{\partial b} = -y_i$$

Convexity Verification

SVM satisfies convexity requirements:

$f(\mathbf{w}, b) = \frac{1}{2}|\mathbf{w}|^2$: Convex (quadratic with positive semidefinite Hessian)
$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)$: Affine, hence convex

Slater's condition: There exists $(\mathbf{w}, b)$ with $g_i(\mathbf{w}, b) < 0$ for all $i$ (strict feasibility). This holds when data is linearly separable with margin.

The SVM Lagrangian (Restated):

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 + \sum_{i=1}^n \alpha_i \left[ 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \right]$$

$$= \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i y_i (\mathbf{w}^\top\mathbf{x}i + b) + \sum{i=1}^n \alpha_i$$

We've already computed the stationarity conditions from this Lagrangian. Now we'll state the complete KKT system.

The Complete KKT Conditions for Hard Margin SVM

Applying the general KKT framework to SVM, we obtain the complete set of optimality conditions.

KKT Conditions for Hard Margin SVM

The point $(\mathbf{w}^, b^, \boldsymbol{\alpha}^*)$ is optimal if and only if:

1. Stationarity: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i$$ $$\sum{i=1}^n \alpha_i^* y_i = 0$$

2. Primal Feasibility: $$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1 \quad \forall i = 1,\ldots,n$$

3. Dual Feasibility: $$\alpha_i^* \geq 0 \quad \forall i = 1,\ldots,n$$

4. Complementary Slackness: $$\alpha_i^* \left[ 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \right] = 0 \quad \forall i = 1,\ldots,n$$

Detailed Derivation of Stationarity:

From $\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0}$:

$$\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{x}i = \mathbf{0}$$ $$\Rightarrow \mathbf{w}^* = \sum{i=1}^n \alpha_i^* y_i \mathbf{x}_i$$

From $\frac{\partial \mathcal{L}}{\partial b} = 0$:

$$-\sum_{i=1}^n \alpha_i y_i = 0$$ $$\Rightarrow \sum_{i=1}^n \alpha_i^* y_i = 0$$

Complementary Slackness Explained:

$$\alpha_i^* [1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)] = 0$$

This product being zero means:

Either $\alpha_i^* = 0$ (point is not a support vector)
Or $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = 1$ (point is on the margin)
Or both

KKT Conditions Summary for SVM
Condition	Mathematical Statement	Intuitive Meaning
Stationarity (w)	$\mathbf{w}^* = \sum \alpha_i^* y_i \mathbf{x}_i$	Weight is a linear combination of data
Stationarity (b)	$\sum \alpha_i^* y_i = 0$	Class balance in support vectors
Primal Feasibility	$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$	All points correctly classified with margin
Dual Feasibility	$\alpha_i^* \geq 0$	Multipliers are non-negative
Complementary Slackness	$\alpha_i^*(1 - y_i(\cdot)) = 0$	Only margin points have $\alpha > 0$

Deep Dive: Complementary Slackness

Complementary slackness is the condition that links primal and dual solutions, revealing the support vector structure.

The Condition:

$$\alpha_i^* [1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)] = 0 \quad \forall i$$

Equivalently, using the functional margin $\hat{\gamma}_i = y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)$:

$$\alpha_i^* (1 - \hat{\gamma}_i) = 0$$

Case Analysis:

The Three Cases

For each training point $i$, exactly one of the following holds:

Case 1: $\alpha_i^* = 0$ and $\hat{\gamma}_i > 1$ — Non-support-vector (strictly beyond margin)

Case 2: $\alpha_i^* > 0$ and $\hat{\gamma}_i = 1$ — Support vector (on the margin)

Case 3: $\alpha_i^* = 0$ and $\hat{\gamma}_i = 1$ — Boundary case (on margin but not SV, rare in practice)

Why Cases 1 and 2 Dominate:

In general position (data points in "typical" arrangement):

Most points have $\hat{\gamma}_i > 1$ and $\alpha_i^* = 0$ (Case 1)
A few critical points have $\hat{\gamma}_i = 1$ and $\alpha_i^* > 0$ (Case 2)
Case 3 (on margin but $\alpha = 0$) requires special degeneracy

Geometric Picture:

Imagine the margin band expanding from width 0:

Most points are never touched (remain in Case 1)
Some points are touched exactly when margin reaches them (become SVs, Case 2)
Generically, no point sits exactly on margin by accident (Case 3 unlikely)

Economic Interpretation:

$\alpha_i^*$ can be viewed as the "shadow price" of constraint $i$:

$\alpha_i^* = 0$: Constraint $i$ is not binding; relaxing it doesn't help
$\alpha_i^* > 0$: Constraint $i$ is binding; relaxing it would improve the objective

The objective $\frac{1}{2}|\mathbf{w}|^2$ measures the margin (inversely). Relaxing a binding constraint (moving an SV) allows a larger margin.

Using KKT for Solution Verification

One practical use of KKT is verifying that a computed solution is truly optimal.

Verification Procedure:

Given a candidate solution $(\mathbf{w}^, b^, \boldsymbol{\alpha}^*)$, check:

1. Stationarity:
   - Compute w_check = sum(alpha[i] * y[i] * x[i])
   - Verify ||w* - w_check|| < tolerance
   - Verify |sum(alpha[i] * y[i])| < tolerance

2. Primal Feasibility:
   - For each i: verify y[i] * (w*.T @ x[i] + b*) >= 1 - tolerance

3. Dual Feasibility:
   - For each i: verify alpha[i] >= -tolerance

4. Complementary Slackness:
   - For each i: compute slack = 1 - y[i] * (w*.T @ x[i] + b*)
   - Verify |alpha[i] * slack| < tolerance

Typical Tolerances:

Condition	Tolerance	Reason
Stationarity	$10^{-6}$	Numerical precision of gradients
Primal Feasibility	$10^{-6}$	Slightly below 1 is acceptable
Dual Feasibility	$10^{-8}$	Very small negative okay
Complementary Slackness	$10^{-6}$	Product of two small numbers

Common Verification Failures

If stationarity fails: Solution didn't converge; increase iterations or check solver tolerance.

If primal feasibility fails: Data may not be linearly separable; use soft margin SVM.

If dual feasibility fails: Negative $\alpha$ indicates solver bug or numerical issues.

If complementary slackness fails: Usually numerical precision issue; check magnitude.

Python Verification Code:

def verify_kkt(X, y, w, b, alpha, tol=1e-6):
    n = len(y)
    violations = []
    
    # Stationarity for w
    w_check = sum(alpha[i] * y[i] * X[i] for i in range(n))
    if np.linalg.norm(w - w_check) > tol:
        violations.append(f"w stationarity: ||w - w_check|| = {np.linalg.norm(w - w_check)}")
    
    # Stationarity for b
    alpha_y_sum = sum(alpha[i] * y[i] for i in range(n))
    if abs(alpha_y_sum) > tol:
        violations.append(f"b stationarity: sum(alpha*y) = {alpha_y_sum}")
    
    # Primal and dual feasibility, complementary slackness
    for i in range(n):
        margin = y[i] * (np.dot(w, X[i]) + b)
        if margin < 1 - tol:
            violations.append(f"Primal feasibility i={i}: margin = {margin}")
        if alpha[i] < -tol:
            violations.append(f"Dual feasibility i={i}: alpha = {alpha[i]}")
        slack = 1 - margin
        if abs(alpha[i] * slack) > tol:
            violations.append(f"Comp slackness i={i}: alpha*slack = {alpha[i]*slack}")
    
    return len(violations) == 0, violations

Computing the Bias Term via KKT

The bias term $b^*$ is not directly given by stationarity (which gives $\nabla_b \mathcal{L} = 0 \Rightarrow \sum \alpha_i y_i = 0$). We compute it using complementary slackness.

From Complementary Slackness:

For any support vector $k$ (where $\alpha_k^* > 0$), complementary slackness gives:

$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1$$

Solving for $b^*$:

$$b^* = y_k - \mathbf{w}^{*\top}\mathbf{x}_k$$

Substituting $\mathbf{w}^* = \sum_i \alpha_i^* y_i \mathbf{x}_i$:

$$b^* = y_k - \sum_{i=1}^n \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}_k)$$

Averaging for Numerical Stability

Any single SV gives a valid $b^*$, but numerical precision varies. For stability, average over all SVs:

$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}_k) \right)$$

This reduces the impact of any single SV's numerical error.

Why Multiple SV Formulas Are Consistent:

Suppose we compute $b^*$ from two different SVs $k$ and $\ell$. Both should give the same value because:

$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1 \quad \text{and} \quad y_\ell(\mathbf{w}^{\top}\mathbf{x}_\ell + b^) = 1$$

These are both on the margin hyperplane for their respective classes, so the $b^*$ solving both is unique.

Numerically, small differences arise due to:

Floating-point precision in $\mathbf{w}^{*\top}\mathbf{x}_k$
Solver not converging to machine precision
Kernel approximation errors

Practical Code:

def compute_b(X, y, alpha, tol=1e-6):
    # Find support vectors
    sv_indices = np.where(alpha > tol)[0]
    
    # Compute b for each SV
    b_values = []
    for k in sv_indices:
        w_dot_xk = sum(alpha[i] * y[i] * np.dot(X[i], X[k]) 
                       for i in sv_indices)
        b_k = y[k] - w_dot_xk
        b_values.append(b_k)
    
    # Return average
    return np.mean(b_values)

KKT Conditions and Sensitivity Analysis

KKT conditions also enable sensitivity analysis—understanding how the optimal solution changes when the problem data changes.

The Lagrange Multiplier Interpretation:

Consider perturbing constraint $i$ to $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 + \epsilon$. The optimal objective value changes approximately as:

$$p^(\epsilon) \approx p^(0) - \alpha_i^* \cdot \epsilon$$

This shows:

$\alpha_i^* > 0$ (SV): Relaxing the constraint ($\epsilon > 0$) decreases the objective (allows larger margin)
$\alpha_i^* = 0$ (non-SV): Relaxing the constraint has no first-order effect

Practical Implications:

Sensitivity Insights from KKT

•High $\alpha_i^*$: Point $i$ strongly influences the boundary. Removing it would significantly change the solution.
•Low $\alpha_i^*$: Point $i$ barely influences the boundary. It's a 'marginal' support vector.
•*$\alpha_i^ = 0$**: Point $i$ doesn't influence the solution. Safe to remove for computational efficiency.
•Sum of $\alpha_i^*$: Related to the margin size; larger sum often means tighter margin.

Effect of Data Perturbation:

If a training point $\mathbf{x}_i$ is perturbed by $\delta\mathbf{x}_i$:

Non-SV: Perturbation has no effect on solution (until it becomes an SV)
SV: Solution changes; new $\mathbf{w}^$, $b^$ depend on $\delta\mathbf{x}_i$

Stability Analysis:

The SVM solution is stable in the sense that:

Small perturbations to non-SVs: No change
Small perturbations to SVs: Small changes to solution
Only when a perturbation causes an SV to leave the margin (or a non-SV to join) does the active set change, potentially causing discontinuous solution change

Outlier Sensitivity:

Hard margin SVM is sensitive to outliers that are mislabeled or extreme:

An outlier in the wrong region may prevent any separating hyperplane
An outlier near the boundary may become an SV with high $\alpha$

This motivates soft margin SVM, which handles outliers gracefully.

KKT Conditions in Algorithm Design

Many SVM optimization algorithms directly work to satisfy KKT conditions.

Newton's Method for KKT:

The KKT system can be viewed as finding zeros of:

$$F(\mathbf{w}, b, \boldsymbol{\alpha}) = \begin{pmatrix} \mathbf{w} - \sum \alpha_i y_i \mathbf{x}_i \ \sum \alpha_i y_i \ \boldsymbol{\alpha} \odot (\mathbf{1} - \mathbf{y} \odot (\mathbf{X}\mathbf{w} + b\mathbf{1})) \end{pmatrix} = \mathbf{0}$$

Newton's method solves this by iterating: $$\mathbf{z}^{k+1} = \mathbf{z}^k - [\nabla F(\mathbf{z}^k)]^{-1} F(\mathbf{z}^k)$$

The challenge: complementary slackness is non-smooth (product of two terms).

Interior Point and KKT:

Interior point methods relax complementary slackness to $\alpha_i s_i = \mu$ (barrier parameter), then drive $\mu \to 0$. At each step, they solve a smoothed version of KKT.

SMO and KKT Violations

The SMO algorithm uses KKT violations to select which variables to optimize:

Compute KKT violation for each variable
Select the pair with largest violation (most improvable)
Optimize just those two variables (closed form!)
Repeat until all violations are below tolerance

Convergence is guaranteed because each step reduces total KKT violation.

Convergence Criteria:

Algorithms typically terminate when:

$$\max_i |\text{KKT violation}_i| < \epsilon$$

where KKT violation for constraint $i$ measures how far from satisfying:

Stationarity
Complementary slackness
Feasibility

Specific SMO Violation Measure:

For variable $\alpha_i$:

If $\alpha_i = 0$: violation if $y_i(\mathbf{w}^\top\mathbf{x}_i + b) < 1$
If $\alpha_i > 0$ (no upper bound in hard margin): violation if $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \neq 1$

The variable with highest violation is prioritized for optimization.

Preview: KKT for Soft Margin SVM

The KKT framework naturally extends to soft margin SVM, which we'll study in the next module.

Soft Margin Primal:

$$\begin{aligned} \min_{\mathbf{w}, b, \boldsymbol{\xi}} \quad & \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i \ \text{s.t.} \quad & y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad i = 1,\ldots,n \ & \xi_i \geq 0, \quad i = 1,\ldots,n \end{aligned}$$

The slack variables $\xi_i$ allow margin violations, with penalty $C$.

Modified KKT Conditions:

The key difference: upper bounds on $\alpha_i$. The KKT conditions become:

Stationarity: Same
Primal Feasibility: Margin $\geq 1 - \xi_i$, and $\xi_i \geq 0$
Dual Feasibility: $0 \leq \alpha_i \leq C$ (bounded!)
Complementary Slackness: Two conditions (one for margin, one for slack)

The Box Constraint

In soft margin SVM, $\alpha_i \in [0, C]$ instead of just $\alpha_i \geq 0$. This bounded region is called a "box constraint." Points with $\alpha_i = C$ are margin violators—they pay the maximum penalty.

Classification of Points (Soft Margin):

$\alpha_i$ Value	Condition	Classification
$\alpha_i = 0$	$\hat{\gamma}_i > 1$	Correctly classified, beyond margin
$0 < \alpha_i < C$	$\hat{\gamma}_i = 1$	Support vector on margin
$\alpha_i = C$	$\hat{\gamma}_i \leq 1$	Margin violator (possibly misclassified)

This three-way classification (vs. two-way for hard margin) reflects the additional flexibility of soft margin.

The Regularization-Margin Tradeoff:

$C$ controls the tradeoff:

Large $C$: Penalize violations heavily → narrow margin, few errors
Small $C$: Tolerate violations → wider margin, more errors

KKT conditions with $C$ show how this tradeoff is encoded mathematically.

Summary: KKT as the Optimality Foundation

The KKT conditions complete our theoretical treatment of hard margin SVM. They encapsulate everything we've learned into a single, unified framework.

Key Takeaways

•KKT = complete optimality: For convex SVM, KKT conditions are necessary AND sufficient
•Four conditions: Stationarity, primal feasibility, dual feasibility, complementary slackness
•Stationarity: $\mathbf{w}^* = \sum \alpha_i^* y_i \mathbf{x}_i$ and $\sum \alpha_i^* y_i = 0$
•Complementary slackness: Identifies SVs ($\alpha_i^* > 0 \Leftrightarrow$ on margin)
•Bias computation: Use any SV with CS condition; average for stability
•Verification: Check all KKT conditions to validate solutions
•Sensitivity: $\alpha_i^*$ measures sensitivity to constraint $i$
•Algorithm design: SMO uses KKT violations for variable selection

Module Complete

You have now completed the theoretical foundations of Hard Margin SVM! You understand:

✓ The optimization formulation and why it's tractable ✓ Quadratic programming and solution methods ✓ Lagrangian duality and the dual problem derivation ✓ Strong duality and support vector identification ✓ Complete KKT conditions for optimality

The next module covers Soft Margin SVM, extending these ideas to handle non-separable data through slack variables and the regularization parameter $C$.

The Complete Hard Margin SVM Theory:

$$\boxed{\begin{aligned} &\text{Primal: } \min \frac{1}{2}|\mathbf{w}|^2 \text{ s.t. } y_i(\mathbf{w}^\top\mathbf{x}i + b) \geq 1 \ &\text{Dual: } \max \sum \alpha_i - \frac{1}{2}\sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j \text{ s.t. } \alpha_i \geq 0, \sum \alpha_i y_i = 0 \ &\text{KKT: Stationarity, Feasibility, Complementary Slackness} \ &\text{Solution: } \mathbf{w}^* = \sum{i \in SV} \alpha_i^* y_i \mathbf{x}_i, \quad b^* = y_k - \mathbf{w}^{*\top}\mathbf{x}_k \text{ for } k \in SV \end{aligned}}$$

This elegant mathematical structure—primal, dual, KKT—is the foundation upon which all practical SVM implementations are built.