Loading content...
Lagrange multipliers elegantly handle equality constraints like $g(\mathbf{x}) = 0$. But most real-world optimization problems involve inequalities: budgets that cannot be exceeded ($\text{cost} \leq \text{budget}$), physical limits that must be respected ($\text{pressure} \leq \text{max}$), or mathematical constraints that define feasibility ($y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ in SVMs).
The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to this broader setting. Named after Harold Kuhn and Albert Tucker (with earlier work by William Karush), these conditions form the theoretical backbone of modern constrained optimization—from interior point methods to the mathematical foundations of Support Vector Machines.
By the end of this page, you will understand KKT conditions completely: what they are, why they're necessary for optimality, how to apply them, and what complementary slackness means. You'll see how these conditions directly connect to the sparsity of support vectors in SVMs and the geometry of constrained optimization.
The general nonlinear programming problem can be written as:
$$\min_{\mathbf{x}} f(\mathbf{x})$$ $$\text{subject to:}$$ $$g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m \quad \text{(inequality constraints)}$$ $$h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p \quad \text{(equality constraints)}$$
where:
Why $g_i(\mathbf{x}) \leq 0$ Convention?
This form is standard. Any constraint can be rewritten to fit:
An inequality constraint $g_i(\mathbf{x}) \leq 0$ is active (or binding) at point $\mathbf{x}^$ if $g_i(\mathbf{x}^) = 0$. It's inactive (or slack) if $g_i(\mathbf{x}^*) < 0$. This distinction is crucial: active constraints restrict the solution; inactive constraints don't affect it at all.
Constraint 1: x² + y² - 1 ≤ 0 (unit circle), Constraint 2: -x ≤ 0 (right half-plane)Optimal point: (-1/√2, -1/√2) if unconstrained, but with x ≥ 0, optimal is (0, -1)At (0, -1): the circle constraint has g₁ = 0 + 1 - 1 = 0 (active), and the x ≥ 0 constraint has g₂ = -0 = 0 (also active). Both constraints are binding at the solution.
For the general problem with both inequality and equality constraints, we define the generalized Lagrangian:
$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^{m} \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^{p} \beta_j h_j(\mathbf{x})$$
where:
Critical Difference from Pure Lagrangian:
For inequality constraints, the multipliers $\alpha_i$ must be non-negative. This sign restriction captures the one-sided nature of inequalities and leads to complementary slackness.
With an equality $h = 0$, the constraint 'pushes back' symmetrically in both directions, so β can be positive or negative. With an inequality $g ≤ 0$, the constraint only 'pushes' when you hit the boundary, and always pushes inward—hence α ≥ 0. If α were negative, the constraint would 'pull' you out of the feasible region, violating the inequality.
The Lagrangian Dual Function:
Define the dual function as:
$$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$
This function is obtained by minimizing the Lagrangian over $\mathbf{x}$ for fixed multipliers. The dual function provides lower bounds on the optimal value of the original (primal) problem when $\boldsymbol{\alpha} \geq 0$.
This duality perspective—optimizing over multipliers rather than original variables—is fundamental to understanding SVMs and will be developed fully in later pages.
At a local minimum $\mathbf{x}^$ (subject to constraint qualification), there exist multipliers $\boldsymbol{\alpha}^ \geq 0$ and $\boldsymbol{\beta}^*$ such that:
1. Stationarity: $$\nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^, \boldsymbol{\alpha}^, \boldsymbol{\beta}^) = 0$$ $$\Leftrightarrow \quad \nabla f(\mathbf{x}^) + \sum_{i=1}^{m} \alpha_i^* \nabla g_i(\mathbf{x}^) + \sum_{j=1}^{p} \beta_j^ \nabla h_j(\mathbf{x}^*) = 0$$
2. Primal Feasibility: $$g_i(\mathbf{x}^) \leq 0 \quad \forall i = 1, \ldots, m$$ $$h_j(\mathbf{x}^) = 0 \quad \forall j = 1, \ldots, p$$
3. Dual Feasibility: $$\alpha_i^* \geq 0 \quad \forall i = 1, \ldots, m$$
4. Complementary Slackness: $$\alpha_i^* g_i(\mathbf{x}^*) = 0 \quad \forall i = 1, \ldots, m$$
| Condition | Mathematical Form | Interpretation |
|---|---|---|
| Stationarity | $\nabla_{\mathbf{x}} \mathcal{L} = 0$ | Gradient of Lagrangian vanishes at optimum |
| Primal Feasibility | $g_i \leq 0, h_j = 0$ | Solution satisfies all constraints |
| Dual Feasibility | $\alpha_i \geq 0$ | Inequality multipliers are non-negative |
| Complementary Slackness | $\alpha_i g_i = 0$ | Either constraint is active OR multiplier is zero |
A point satisfying KKT is a candidate for optimality, not necessarily an actual optimum (could be a saddle point). For convex problems (convex f, convex g's, affine h's), KKT conditions become both necessary AND sufficient—this is why convexity is so powerful.
Complementary slackness is perhaps the most subtle and important of the KKT conditions. It states:
$$\alpha_i^* g_i(\mathbf{x}^*) = 0 \quad \text{for each } i$$
Since $\alpha_i \geq 0$ and $g_i \leq 0$, this product can only be zero. But how it's zero reveals crucial information:
Case 1: Constraint is inactive ($g_i(\mathbf{x}^*) < 0$)
Case 2: Constraint is active ($g_i(\mathbf{x}^*) = 0$)
The name comes from the complementary relationship: either the constraint is active (g = 0) or the multiplier is zero (α = 0), but never both non-zero. They 'complement' each other like binary switches—when one is on, the other is off. We can also have both zero, but having both non-zero is impossible.
SVM Connection:
In Support Vector Machines, complementary slackness explains support vectors:
This sparsity isn't an algorithmic trick—it's a direct consequence of KKT conditions!
The stationarity condition has a beautiful geometric meaning. Let's visualize it.
At a Constrained Minimum:
The stationarity condition says:
$$-\nabla f(\mathbf{x}^) = \sum_i \alpha_i^ \nabla g_i(\mathbf{x}^) + \sum_j \beta_j^ \nabla h_j(\mathbf{x}^*)$$
Interpretation:
Why This Must Be True:
If $-\nabla f$ had any component along the constraint surface (tangent to active constraints), we could move in that direction, stay feasible, and decrease $f$—contradicting optimality.
So at the optimum, every direction that decreases $f$ must violate some constraint. The descent direction is "blocked" by the active constraints.
The active constraint gradients define a 'cone' of forbidden directions. At a constrained minimum, the negative gradient of the objective must lie inside this cone. If it pointed outside, there would be a feasible descent direction, and we wouldn't be at a minimum.
Comparison with Unconstrained Optimization:
| Unconstrained Minimum | Constrained Minimum (KKT) |
|---|---|
| $\nabla f = 0$ | $\nabla f = -\sum \alpha_i \nabla g_i - \sum \beta_j \nabla h_j$ |
| Gradient vanishes | Gradient balanced by constraint forces |
| Stationary in all directions | Stationary only in feasible directions |
| No constraint "pushback" | Active constraints "push" against descent |
The key insight: at a constrained optimum, you're not at a critical point of $f$ — rather, the gradient of $f$ is exactly balanced by the "forces" from active constraints.
Let's develop a systematic approach to solving constrained optimization problems using KKT conditions.
General Strategy:
Write the Lagrangian with multipliers $\alpha_i \geq 0$ for inequalities and $\beta_j$ for equalities
Write all KKT conditions:
Case analysis on complementary slackness: For each inequality constraint, consider both cases ($\alpha_i = 0$ or $g_i = 0$) and solve the resulting system
Check solutions: Verify primal and dual feasibility, discard invalid cases
Compare objectives: Evaluate $f$ at all valid KKT points to find the global optimum
Rewrite constraint as g(x,y) = 1 - x - y ≤ 0Optimal solution: x = y = 1/2 with α = 1We'll solve this step-by-step in the code below, showing how complementary slackness identifies whether the constraint is active.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom scipy.optimize import minimize def kkt_worked_example(): """ Minimize f(x,y) = x² + y² subject to x + y ≥ 1 Standard form: min x² + y² s.t. g(x,y) = 1 - x - y ≤ 0 Lagrangian: L(x,y,α) = x² + y² + α(1 - x - y) KKT Conditions: 1. Stationarity: ∂L/∂x = 2x - α = 0 → x = α/2 ∂L/∂y = 2y - α = 0 → y = α/2 2. Primal Feasibility: 1 - x - y ≤ 0 → x + y ≥ 1 3. Dual Feasibility: α ≥ 0 4. Complementary Slackness: α(1 - x - y) = 0 Case Analysis: Case 1: α = 0 Then x = y = 0 Check primal: 0 + 0 = 0 ≥ 1? NO! This case violates primal feasibility. INVALID. Case 2: α > 0 (complementary slackness forces g = 0) Then 1 - x - y = 0 → x + y = 1 From stationarity: x = y = α/2 So: α/2 + α/2 = 1 → α = 1 Thus: x = y = 1/2, α = 1 Check all conditions: - Stationarity: 2(1/2) - 1 = 0 ✓ - Primal: 1/2 + 1/2 = 1 ✓ - Dual: α = 1 ≥ 0 ✓ - Comp. Slack.: 1 · 0 = 0 ✓ ALL CONDITIONS SATISFIED! """ print("KKT Worked Example") print("=" * 60) print("Problem: min x² + y² subject to x + y ≥ 1") print() # Analytical solution from KKT x_opt, y_opt = 0.5, 0.5 alpha_opt = 1.0 f_opt = x_opt**2 + y_opt**2 print("Analytical Solution (KKT):") print(f" Optimal point: ({x_opt}, {y_opt})") print(f" Optimal value: f* = {f_opt}") print(f" KKT multiplier: α* = {alpha_opt}") print() # Verify KKT conditions print("KKT Condition Verification:") print(f" 1. Stationarity: 2x - α = 2({x_opt}) - {alpha_opt} = {2*x_opt - alpha_opt}") print(f" 2y - α = 2({y_opt}) - {alpha_opt} = {2*y_opt - alpha_opt}") print(f" 2. Primal Feas: x + y = {x_opt + y_opt} ≥ 1 ✓") print(f" 3. Dual Feas: α = {alpha_opt} ≥ 0 ✓") constraint_value = 1 - x_opt - y_opt print(f" 4. Comp. Slack: α·g = {alpha_opt}·{constraint_value} = {alpha_opt * constraint_value} ✓") print() # Numerical verification def objective(xy): return xy[0]**2 + xy[1]**2 def constraint(xy): return xy[0] + xy[1] - 1 # x + y >= 1 result = minimize( objective, x0=[0, 0], constraints={'type': 'ineq', 'fun': constraint}, method='SLSQP' ) print("Numerical Verification (scipy):") print(f" Optimal point: ({result.x[0]:.6f}, {result.x[1]:.6f})") print(f" Optimal value: {result.fun:.6f}") return x_opt, y_opt, alpha_opt kkt_worked_example()For KKT conditions to be necessary at a local minimum, a constraint qualification must hold. Several constraint qualifications exist, each with different strengths:
Linear Independence Constraint Qualification (LICQ):
The gradients of all active constraints are linearly independent at $\mathbf{x}^*$.
Slater's Condition (for convex problems):
There exists a strictly feasible point $\bar{\mathbf{x}}$ such that:
Slater's condition is particularly important because it's easy to verify and commonly satisfied in practice.
| CQ Name | Requirement | When Used |
|---|---|---|
| LICQ | Active constraint gradients linearly independent | Most general; ensures unique multipliers |
| MFCQ | Constraint gradients span feasible directions | Weaker than LICQ; allows non-unique multipliers |
| Slater | Strictly feasible interior point exists | Convex problems; key for strong duality |
| Linearity | All constraints are affine | LP and QP; always satisfied if feasible |
In machine learning, constraints are often linear (e.g., sum-to-one, budget constraints) or nice convex functions (e.g., norm bounds). These almost always satisfy Slater's condition—a strictly interior point exists. So for practical ML optimization, you can usually trust that KKT conditions are both necessary and identify the global optimum.
The power of KKT conditions is amplified dramatically for convex optimization problems.
Definition: Convex Optimization Problem
A problem is convex if:
The Fundamental Theorem:
For convex problems satisfying Slater's condition:
KKT conditions are both necessary AND sufficient for global optimality.
This means:
Why This Matters:
For convex problems, finding a KKT point = solving the optimization problem. There's no need to check second-order conditions or compare multiple local minima. This is what makes convex optimization so tractable.
Linear regression, logistic regression, SVMs (both primal and dual), regularized least squares, and many other core ML methods are convex optimization problems. This convexity guarantees that gradient-based methods find global optima, and KKT analysis reveals the structure of solutions.
Strong Duality and KKT:
For convex problems with Slater's condition:
$$f(\mathbf{x}^) = \mathcal{L}(\mathbf{x}^, \boldsymbol{\alpha}^, \boldsymbol{\beta}^) = d(\boldsymbol{\alpha}^, \boldsymbol{\beta}^)$$
where $d$ is the dual function. The primal optimal value equals the dual optimal value—there's no "duality gap." This is called strong duality, and it enables us to solve the dual problem instead of the primal, often with significant computational benefits.
We've developed the complete KKT framework for handling inequality constraints. Let's consolidate:
What's Next:
KKT conditions establish when a point is optimal. But they also hint at a profound perspective: instead of optimizing over $\mathbf{x}$, what if we optimize over the multipliers $\boldsymbol{\alpha}, \boldsymbol{\beta}$? This leads to duality theory—the subject of our next page. Duality reveals hidden structure in optimization problems and is the key to understanding why SVMs are so elegant.
You now understand the KKT conditions—the generalization of Lagrange multipliers to inequality-constrained problems. You can formulate KKT conditions, solve problems using case analysis on complementary slackness, and appreciate their special power for convex optimization. Next, we explore the dual perspective that makes this framework even more powerful.