Loading content...
Throughout this module, we have encountered various conditions: stationarity conditions from the Lagrangian, complementary slackness, dual feasibility, and primal feasibility. The Karush-Kuhn-Tucker (KKT) conditions unify all of these into a single, coherent optimality framework.
The KKT conditions are the fundamental necessary and sufficient conditions for optimality in constrained optimization. For convex problems like SVM, a point is optimal if and only if it satisfies the KKT conditions. This makes KKT the theoretical workhorse for:
By the end of this page, you will: (1) understand the complete KKT conditions for general constrained optimization, (2) derive the specific KKT conditions for hard margin SVM, (3) interpret each condition geometrically and computationally, (4) use KKT for solution verification, and (5) understand why KKT conditions characterize optimality.
Before specializing to SVM, let's establish the general KKT framework for constrained optimization.
The General Problem:
$$\begin{aligned} \min_{\mathbf{x} \in \mathbb{R}^n} \quad & f(\mathbf{x}) \ \text{s.t.} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1,\ldots,m \ & h_j(\mathbf{x}) = 0, \quad j = 1,\ldots,p \end{aligned}$$
The KKT Conditions:
A point $\mathbf{x}^$ is a KKT point if there exist multipliers $\boldsymbol{\lambda}^ \in \mathbb{R}^m$ and $\boldsymbol{\nu}^* \in \mathbb{R}^p$ such that:
1. Stationarity: $$\nabla f(\mathbf{x}^) + \sum_{i=1}^m \lambda_i^ \nabla g_i(\mathbf{x}^) + \sum_{j=1}^p \nu_j^ \nabla h_j(\mathbf{x}^*) = \mathbf{0}$$
2. Primal Feasibility: $$g_i(\mathbf{x}^) \leq 0 \quad \forall i, \qquad h_j(\mathbf{x}^) = 0 \quad \forall j$$
3. Dual Feasibility: $$\lambda_i^* \geq 0 \quad \forall i$$
4. Complementary Slackness: $$\lambda_i^* g_i(\mathbf{x}^*) = 0 \quad \forall i$$
For convex optimization problems (convex $f$ and $g_i$, affine $h_j$) satisfying a constraint qualification (e.g., Slater's condition):
$$\mathbf{x}^* \text{ is optimal} \iff \mathbf{x}^* \text{ satisfies KKT conditions}$$
KKT conditions are both necessary AND sufficient for optimality.
Interpretation of Each Condition:
| Condition | Interpretation |
|---|---|
| Stationarity | At optimum, gradient of Lagrangian vanishes; no improving direction within constraints |
| Primal Feasibility | Solution satisfies all original constraints |
| Dual Feasibility | Multipliers for inequalities are non-negative (correct 'sign' for penalization) |
| Complementary Slackness | Multiplier is zero for inactive constraints; positive only for tight constraints |
The Lagrangian Connection:
KKT conditions are optimality conditions for the Lagrangian:
$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \nu_j h_j(\mathbf{x})$$
Stationarity says $\nabla_\mathbf{x} \mathcal{L} = \mathbf{0}$ at optimum. The other conditions ensure the Lagrangian is being used correctly.
Let's map the hard margin SVM problem to the general KKT framework.
The SVM Primal:
$$\begin{aligned} \min_{\mathbf{w}, b} \quad & f(\mathbf{w}, b) = \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0, \quad i = 1,\ldots,n \end{aligned}$$
Mapping to General Form:
| General | SVM Specific |
|---|---|
| Decision variable $\mathbf{x}$ | $(\mathbf{w}, b) \in \mathbb{R}^{d+1}$ |
| Objective $f$ | $\frac{1}{2}|\mathbf{w}|^2$ |
| Inequality constraints $g_i$ | $1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)$ |
| Equality constraints $h_j$ | None |
| Multipliers $\lambda_i$ | $\alpha_i$ (using SVM notation) |
Gradients Needed:
$$\nabla_\mathbf{w} f = \mathbf{w}$$ $$\frac{\partial f}{\partial b} = 0$$
$$\nabla_\mathbf{w} g_i = -y_i \mathbf{x}_i$$ $$\frac{\partial g_i}{\partial b} = -y_i$$
SVM satisfies convexity requirements:
Slater's condition: There exists $(\mathbf{w}, b)$ with $g_i(\mathbf{w}, b) < 0$ for all $i$ (strict feasibility). This holds when data is linearly separable with margin.
The SVM Lagrangian (Restated):
$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 + \sum_{i=1}^n \alpha_i \left[ 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \right]$$
$$= \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i y_i (\mathbf{w}^\top\mathbf{x}i + b) + \sum{i=1}^n \alpha_i$$
We've already computed the stationarity conditions from this Lagrangian. Now we'll state the complete KKT system.
Applying the general KKT framework to SVM, we obtain the complete set of optimality conditions.
The point $(\mathbf{w}^, b^, \boldsymbol{\alpha}^*)$ is optimal if and only if:
1. Stationarity: $$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i$$ $$\sum{i=1}^n \alpha_i^* y_i = 0$$
2. Primal Feasibility: $$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1 \quad \forall i = 1,\ldots,n$$
3. Dual Feasibility: $$\alpha_i^* \geq 0 \quad \forall i = 1,\ldots,n$$
4. Complementary Slackness: $$\alpha_i^* \left[ 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \right] = 0 \quad \forall i = 1,\ldots,n$$
Detailed Derivation of Stationarity:
From $\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{0}$:
$$\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{x}i = \mathbf{0}$$ $$\Rightarrow \mathbf{w}^* = \sum{i=1}^n \alpha_i^* y_i \mathbf{x}_i$$
From $\frac{\partial \mathcal{L}}{\partial b} = 0$:
$$-\sum_{i=1}^n \alpha_i y_i = 0$$ $$\Rightarrow \sum_{i=1}^n \alpha_i^* y_i = 0$$
Complementary Slackness Explained:
$$\alpha_i^* [1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)] = 0$$
This product being zero means:
| Condition | Mathematical Statement | Intuitive Meaning |
|---|---|---|
| Stationarity (w) | $\mathbf{w}^* = \sum \alpha_i^* y_i \mathbf{x}_i$ | Weight is a linear combination of data |
| Stationarity (b) | $\sum \alpha_i^* y_i = 0$ | Class balance in support vectors |
| Primal Feasibility | $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$ | All points correctly classified with margin |
| Dual Feasibility | $\alpha_i^* \geq 0$ | Multipliers are non-negative |
| Complementary Slackness | $\alpha_i^*(1 - y_i(\cdot)) = 0$ | Only margin points have $\alpha > 0$ |
Complementary slackness is the condition that links primal and dual solutions, revealing the support vector structure.
The Condition:
$$\alpha_i^* [1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)] = 0 \quad \forall i$$
Equivalently, using the functional margin $\hat{\gamma}_i = y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)$:
$$\alpha_i^* (1 - \hat{\gamma}_i) = 0$$
Case Analysis:
For each training point $i$, exactly one of the following holds:
Case 1: $\alpha_i^* = 0$ and $\hat{\gamma}_i > 1$ — Non-support-vector (strictly beyond margin)
Case 2: $\alpha_i^* > 0$ and $\hat{\gamma}_i = 1$ — Support vector (on the margin)
Case 3: $\alpha_i^* = 0$ and $\hat{\gamma}_i = 1$ — Boundary case (on margin but not SV, rare in practice)
Why Cases 1 and 2 Dominate:
In general position (data points in "typical" arrangement):
Geometric Picture:
Imagine the margin band expanding from width 0:
Economic Interpretation:
$\alpha_i^*$ can be viewed as the "shadow price" of constraint $i$:
The objective $\frac{1}{2}|\mathbf{w}|^2$ measures the margin (inversely). Relaxing a binding constraint (moving an SV) allows a larger margin.
One practical use of KKT is verifying that a computed solution is truly optimal.
Verification Procedure:
Given a candidate solution $(\mathbf{w}^, b^, \boldsymbol{\alpha}^*)$, check:
1. Stationarity:
- Compute w_check = sum(alpha[i] * y[i] * x[i])
- Verify ||w* - w_check|| < tolerance
- Verify |sum(alpha[i] * y[i])| < tolerance
2. Primal Feasibility:
- For each i: verify y[i] * (w*.T @ x[i] + b*) >= 1 - tolerance
3. Dual Feasibility:
- For each i: verify alpha[i] >= -tolerance
4. Complementary Slackness:
- For each i: compute slack = 1 - y[i] * (w*.T @ x[i] + b*)
- Verify |alpha[i] * slack| < tolerance
Typical Tolerances:
| Condition | Tolerance | Reason |
|---|---|---|
| Stationarity | $10^{-6}$ | Numerical precision of gradients |
| Primal Feasibility | $10^{-6}$ | Slightly below 1 is acceptable |
| Dual Feasibility | $10^{-8}$ | Very small negative okay |
| Complementary Slackness | $10^{-6}$ | Product of two small numbers |
If stationarity fails: Solution didn't converge; increase iterations or check solver tolerance.
If primal feasibility fails: Data may not be linearly separable; use soft margin SVM.
If dual feasibility fails: Negative $\alpha$ indicates solver bug or numerical issues.
If complementary slackness fails: Usually numerical precision issue; check magnitude.
Python Verification Code:
def verify_kkt(X, y, w, b, alpha, tol=1e-6):
n = len(y)
violations = []
# Stationarity for w
w_check = sum(alpha[i] * y[i] * X[i] for i in range(n))
if np.linalg.norm(w - w_check) > tol:
violations.append(f"w stationarity: ||w - w_check|| = {np.linalg.norm(w - w_check)}")
# Stationarity for b
alpha_y_sum = sum(alpha[i] * y[i] for i in range(n))
if abs(alpha_y_sum) > tol:
violations.append(f"b stationarity: sum(alpha*y) = {alpha_y_sum}")
# Primal and dual feasibility, complementary slackness
for i in range(n):
margin = y[i] * (np.dot(w, X[i]) + b)
if margin < 1 - tol:
violations.append(f"Primal feasibility i={i}: margin = {margin}")
if alpha[i] < -tol:
violations.append(f"Dual feasibility i={i}: alpha = {alpha[i]}")
slack = 1 - margin
if abs(alpha[i] * slack) > tol:
violations.append(f"Comp slackness i={i}: alpha*slack = {alpha[i]*slack}")
return len(violations) == 0, violations
The bias term $b^*$ is not directly given by stationarity (which gives $\nabla_b \mathcal{L} = 0 \Rightarrow \sum \alpha_i y_i = 0$). We compute it using complementary slackness.
From Complementary Slackness:
For any support vector $k$ (where $\alpha_k^* > 0$), complementary slackness gives:
$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1$$
Solving for $b^*$:
$$b^* = y_k - \mathbf{w}^{*\top}\mathbf{x}_k$$
Substituting $\mathbf{w}^* = \sum_i \alpha_i^* y_i \mathbf{x}_i$:
$$b^* = y_k - \sum_{i=1}^n \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}_k)$$
Any single SV gives a valid $b^*$, but numerical precision varies. For stability, average over all SVs:
$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}_k) \right)$$
This reduces the impact of any single SV's numerical error.
Why Multiple SV Formulas Are Consistent:
Suppose we compute $b^*$ from two different SVs $k$ and $\ell$. Both should give the same value because:
$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1 \quad \text{and} \quad y_\ell(\mathbf{w}^{\top}\mathbf{x}_\ell + b^) = 1$$
These are both on the margin hyperplane for their respective classes, so the $b^*$ solving both is unique.
Numerically, small differences arise due to:
Practical Code:
def compute_b(X, y, alpha, tol=1e-6):
# Find support vectors
sv_indices = np.where(alpha > tol)[0]
# Compute b for each SV
b_values = []
for k in sv_indices:
w_dot_xk = sum(alpha[i] * y[i] * np.dot(X[i], X[k])
for i in sv_indices)
b_k = y[k] - w_dot_xk
b_values.append(b_k)
# Return average
return np.mean(b_values)
KKT conditions also enable sensitivity analysis—understanding how the optimal solution changes when the problem data changes.
The Lagrange Multiplier Interpretation:
Consider perturbing constraint $i$ to $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 + \epsilon$. The optimal objective value changes approximately as:
$$p^(\epsilon) \approx p^(0) - \alpha_i^* \cdot \epsilon$$
This shows:
Practical Implications:
Effect of Data Perturbation:
If a training point $\mathbf{x}_i$ is perturbed by $\delta\mathbf{x}_i$:
Stability Analysis:
The SVM solution is stable in the sense that:
Outlier Sensitivity:
Hard margin SVM is sensitive to outliers that are mislabeled or extreme:
This motivates soft margin SVM, which handles outliers gracefully.
Many SVM optimization algorithms directly work to satisfy KKT conditions.
Newton's Method for KKT:
The KKT system can be viewed as finding zeros of:
$$F(\mathbf{w}, b, \boldsymbol{\alpha}) = \begin{pmatrix} \mathbf{w} - \sum \alpha_i y_i \mathbf{x}_i \ \sum \alpha_i y_i \ \boldsymbol{\alpha} \odot (\mathbf{1} - \mathbf{y} \odot (\mathbf{X}\mathbf{w} + b\mathbf{1})) \end{pmatrix} = \mathbf{0}$$
Newton's method solves this by iterating: $$\mathbf{z}^{k+1} = \mathbf{z}^k - [\nabla F(\mathbf{z}^k)]^{-1} F(\mathbf{z}^k)$$
The challenge: complementary slackness is non-smooth (product of two terms).
Interior Point and KKT:
Interior point methods relax complementary slackness to $\alpha_i s_i = \mu$ (barrier parameter), then drive $\mu \to 0$. At each step, they solve a smoothed version of KKT.
The SMO algorithm uses KKT violations to select which variables to optimize:
Convergence is guaranteed because each step reduces total KKT violation.
Convergence Criteria:
Algorithms typically terminate when:
$$\max_i |\text{KKT violation}_i| < \epsilon$$
where KKT violation for constraint $i$ measures how far from satisfying:
Specific SMO Violation Measure:
For variable $\alpha_i$:
The variable with highest violation is prioritized for optimization.
The KKT framework naturally extends to soft margin SVM, which we'll study in the next module.
Soft Margin Primal:
$$\begin{aligned} \min_{\mathbf{w}, b, \boldsymbol{\xi}} \quad & \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i \ \text{s.t.} \quad & y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad i = 1,\ldots,n \ & \xi_i \geq 0, \quad i = 1,\ldots,n \end{aligned}$$
The slack variables $\xi_i$ allow margin violations, with penalty $C$.
Modified KKT Conditions:
The key difference: upper bounds on $\alpha_i$. The KKT conditions become:
In soft margin SVM, $\alpha_i \in [0, C]$ instead of just $\alpha_i \geq 0$. This bounded region is called a "box constraint." Points with $\alpha_i = C$ are margin violators—they pay the maximum penalty.
Classification of Points (Soft Margin):
| $\alpha_i$ Value | Condition | Classification |
|---|---|---|
| $\alpha_i = 0$ | $\hat{\gamma}_i > 1$ | Correctly classified, beyond margin |
| $0 < \alpha_i < C$ | $\hat{\gamma}_i = 1$ | Support vector on margin |
| $\alpha_i = C$ | $\hat{\gamma}_i \leq 1$ | Margin violator (possibly misclassified) |
This three-way classification (vs. two-way for hard margin) reflects the additional flexibility of soft margin.
The Regularization-Margin Tradeoff:
$C$ controls the tradeoff:
KKT conditions with $C$ show how this tradeoff is encoded mathematically.
The KKT conditions complete our theoretical treatment of hard margin SVM. They encapsulate everything we've learned into a single, unified framework.
You have now completed the theoretical foundations of Hard Margin SVM! You understand:
✓ The optimization formulation and why it's tractable ✓ Quadratic programming and solution methods ✓ Lagrangian duality and the dual problem derivation ✓ Strong duality and support vector identification ✓ Complete KKT conditions for optimality
The next module covers Soft Margin SVM, extending these ideas to handle non-separable data through slack variables and the regularization parameter $C$.
The Complete Hard Margin SVM Theory:
$$\boxed{\begin{aligned} &\text{Primal: } \min \frac{1}{2}|\mathbf{w}|^2 \text{ s.t. } y_i(\mathbf{w}^\top\mathbf{x}i + b) \geq 1 \ &\text{Dual: } \max \sum \alpha_i - \frac{1}{2}\sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j \text{ s.t. } \alpha_i \geq 0, \sum \alpha_i y_i = 0 \ &\text{KKT: Stationarity, Feasibility, Complementary Slackness} \ &\text{Solution: } \mathbf{w}^* = \sum{i \in SV} \alpha_i^* y_i \mathbf{x}_i, \quad b^* = y_k - \mathbf{w}^{*\top}\mathbf{x}_k \text{ for } k \in SV \end{aligned}}$$
This elegant mathematical structure—primal, dual, KKT—is the foundation upon which all practical SVM implementations are built.