Machine LearningConstrained Optimization

Constrained Optimization

LevelIntermediate

Duration90 mins

TopicConstrained Optimization

4 / 5

Dual Problem Formulation

From Theory to Practice

We've established the theoretical foundations of duality: weak duality, strong duality, and the profound insight that dual problems are always convex. Now comes the practical question: how do we actually derive dual problems for specific cases?

This page develops the systematic techniques for dual formulation, focusing on problem classes central to machine learning:

Linear programs (LP)
Quadratic programs (QP)
The SVM primal → dual transformation

Mastering these derivations isn't just academic exercise—it reveals why SVMs work with kernels, why certain algorithms are efficient, and how to exploit problem structure for computational advantage.

What You Will Learn

By the end of this page, you will be able to derive dual problems for linear and quadratic programs, understand the general procedure, and see how the SVM dual emerges naturally—setting the stage for understanding kernel methods and support vector structure.

The General Derivation Procedure

Deriving a dual problem follows a consistent pattern. Let's formalize it:

Step 1: Write the Primal in Standard Form

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0, ; h_j(\mathbf{x}) = 0$$

Step 2: Form the Lagrangian

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_i \alpha_i g_i(\mathbf{x}) + \sum_j \beta_j h_j(\mathbf{x})$$

with $\alpha_i \geq 0$.

Step 3: Compute the Dual Function

$$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

This requires minimizing over $\mathbf{x}$ for fixed multipliers. For many important cases, this has a closed-form solution.

Step 4: Identify Dual Constraints

The domain of $d$ (where it's finite) defines implicit dual constraints. Plus $\boldsymbol{\alpha} \geq 0$.

The Critical Skill: Evaluating the Infimum

The hardest step is Step 3—computing inf_x L. For unconstrained optimization in x, set ∇_x L = 0 and solve. The result expresses x in terms of (α, β). Substitute back into L to get d(α, β). Watch for: (1) cases where L is unbounded below → d = -∞ for those (α, β), and (2) implicit constraints that emerge from requiring the infimum to exist.

Step 5: State the Dual Problem

$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \quad \text{s.t.} \quad \boldsymbol{\alpha} \geq 0, ; \text{(+ implicit constraints)}$$

Step 6: Simplify and Interpret

Often the dual can be written more elegantly by:

Eliminating $\boldsymbol{\beta}$ via its optimality conditions
Rewriting in terms of data inner products (for SVMs)
Recognizing familiar optimization problem forms

Duality for Linear Programs

Linear programming is the simplest case and illustrates the procedure beautifully.

Primal LP (Standard Form):

$$\min_{\mathbf{x}} \mathbf{c}^T \mathbf{x} \quad \text{s.t.} \quad A\mathbf{x} = \mathbf{b}, ; \mathbf{x} \geq 0$$

Step 1: Rewrite constraints

$A\mathbf{x} = \mathbf{b}$ (equality) → multipliers $\boldsymbol{\nu}$ (unrestricted)
$-\mathbf{x} \leq 0$ (non-negativity) → multipliers $\boldsymbol{\lambda} \geq 0$

Step 2: Lagrangian

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = \mathbf{c}^T\mathbf{x} + \boldsymbol{\nu}^T(\mathbf{b} - A\mathbf{x}) - \boldsymbol{\lambda}^T \mathbf{x}$$ $$= \mathbf{c}^T\mathbf{x} + \boldsymbol{\nu}^T\mathbf{b} - \boldsymbol{\nu}^T A\mathbf{x} - \boldsymbol{\lambda}^T \mathbf{x}$$ $$= \boldsymbol{\nu}^T\mathbf{b} + (\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda})^T \mathbf{x}$$

Step 3: Minimize over $\mathbf{x}$

The Lagrangian is linear in $\mathbf{x}$: $$\mathcal{L} = \boldsymbol{\nu}^T\mathbf{b} + \underbrace{(\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda})^T}_{\text{coefficient}} \mathbf{x}$$

If the coefficient of $\mathbf{x}$ is nonzero, we can make $\mathcal{L} \to -\infty$ by taking $\mathbf{x}$ large in the right direction.

So $d(\boldsymbol{\lambda}, \boldsymbol{\nu})$ is finite only when: $$\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda} = 0 \quad \Rightarrow \quad A^T\boldsymbol{\nu} + \boldsymbol{\lambda} = \mathbf{c}$$

When this holds: $d(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \boldsymbol{\nu}^T\mathbf{b}$

Step 4 & 5: Dual Problem

Maximize $\boldsymbol{\nu}^T\mathbf{b}$ subject to:

$A^T\boldsymbol{\nu} + \boldsymbol{\lambda} = \mathbf{c}$
$\boldsymbol{\lambda} \geq 0$

Eliminating $\boldsymbol{\lambda}$: $\boldsymbol{\lambda} = \mathbf{c} - A^T\boldsymbol{\nu} \geq 0$ means $A^T\boldsymbol{\nu} \leq \mathbf{c}$.

Dual LP:

$$\max_{\boldsymbol{\nu}} \mathbf{b}^T \boldsymbol{\nu} \quad \text{s.t.} \quad A^T \boldsymbol{\nu} \leq \mathbf{c}$$

This is the classic LP duality result!

LP Primal-Dual Correspondence
Primal	Dual
min $\mathbf{c}^T\mathbf{x}$	max $\mathbf{b}^T\boldsymbol{\nu}$
$A\mathbf{x} = \mathbf{b}$	$\boldsymbol{\nu}$ unrestricted
$\mathbf{x} \geq 0$	$A^T\boldsymbol{\nu} \leq \mathbf{c}$
$n$ variables	$m$ variables
$m$ equality constraints	$n$ inequality constraints

Duality for Quadratic Programs

Quadratic programs (QP) are central to SVMs, ridge regression, and many ML applications.

Primal QP (Standard Form):

$$\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T\mathbf{x}$$ $$\text{s.t.} \quad A\mathbf{x} \leq \mathbf{b}$$

where $Q$ is symmetric positive semi-definite (for convexity).

Step 1 & 2: Lagrangian

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T\mathbf{x} + \boldsymbol{\alpha}^T(A\mathbf{x} - \mathbf{b})$$

with $\boldsymbol{\alpha} \geq 0$.

Rearranging: $$\mathcal{L} = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + (\mathbf{c} + A^T\boldsymbol{\alpha})^T\mathbf{x} - \boldsymbol{\alpha}^T\mathbf{b}$$

Step 3: Minimize over $\mathbf{x}$

For positive definite $Q$, the Lagrangian is strictly convex in $\mathbf{x}$. Set:

$$\nabla_{\mathbf{x}} \mathcal{L} = Q\mathbf{x} + \mathbf{c} + A^T\boldsymbol{\alpha} = 0$$

Solving: $$\mathbf{x}^* = -Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})$$

(assuming $Q$ is invertible; for semi-definite $Q$, more care is needed)

Substitute back:

$$d(\boldsymbol{\alpha}) = \frac{1}{2}(-Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}))^T Q (-Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})) + (\mathbf{c} + A^T\boldsymbol{\alpha})^T(-Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})) - \boldsymbol{\alpha}^T\mathbf{b}$$

After simplification (using $Q$'s symmetry):

$$d(\boldsymbol{\alpha}) = -\frac{1}{2}(\mathbf{c} + A^T\boldsymbol{\alpha})^T Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}) - \boldsymbol{\alpha}^T\mathbf{b}$$

Dual QP is Also Quadratic

The dual function is quadratic (concave) in α. Maximizing a concave quadratic subject to α ≥ 0 is another QP. This means QP duality leads to another QP—just in different variables and often with different structure that may be computationally advantageous.

Dual QP:

$$\max_{\boldsymbol{\alpha} \geq 0} -\frac{1}{2}(\mathbf{c} + A^T\boldsymbol{\alpha})^T Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}) - \boldsymbol{\alpha}^T\mathbf{b}$$

Or equivalently (switching to minimization and negating):

$$\min_{\boldsymbol{\alpha} \geq 0} \frac{1}{2}\boldsymbol{\alpha}^T (A Q^{-1} A^T) \boldsymbol{\alpha} + (\mathbf{b} + A Q^{-1}\mathbf{c})^T\boldsymbol{\alpha} + \frac{1}{2}\mathbf{c}^T Q^{-1}\mathbf{c}$$

The constant term doesn't affect the optimal $\boldsymbol{\alpha}$.

The SVM Primal Problem

Let's derive the dual of the hard-margin SVM—the most celebrated application of Lagrangian duality in machine learning.

Hard-Margin SVM Primal:

Given training data $(\mathbf{x}_i, y_i)$ for $i = 1, \ldots, n$ with $y_i \in {-1, +1}$:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n$$

Interpretation:

Minimize $|\mathbf{w}|^2$ → maximize the margin $\frac{2}{|\mathbf{w}|}$
Constraints ensure all points are correctly classified with margin ≥ 1

Convert to Standard Form:

$$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$$

Why This Formulation?

Minimizing ||w||² with margin constraints is equivalent to maximizing the geometric margin between classes. The hard-margin SVM finds the maximum-margin separating hyperplane—the classifier that's most 'confident' on all training points.

The Lagrangian:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 + \sum_{i=1}^{n} \alpha_i \left(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)\right)$$

$$= \frac{1}{2}|\mathbf{w}|^2 + \sum_i \alpha_i - \sum_i \alpha_i y_i \mathbf{w}^T\mathbf{x}_i - b\sum_i \alpha_i y_i$$

with $\alpha_i \geq 0$ for all $i$.

Deriving the SVM Dual

Now we minimize the Lagrangian over $(\mathbf{w}, b)$.

Step 1: Minimize over $b$

$$\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0$$

This gives the constraint: $\sum_i \alpha_i y_i = 0$

If this doesn't hold, the Lagrangian is linear in $b$ and can go to $-\infty$.

Step 2: Minimize over $\mathbf{w}$

$$\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0$$

Solving: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$

This is beautiful: the optimal $\mathbf{w}$ is a linear combination of training points!

The Support Vector Revelation

The formula w = Σᵢ αᵢyᵢxᵢ shows that w is built from training points weighted by their multipliers. Points with αᵢ = 0 don't contribute at all. The points with αᵢ > 0 are the 'support vectors'—they're the only data that defines the decision boundary.

Step 3: Substitute Back

With $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ and $\sum_i \alpha_i y_i = 0$:

$$\mathcal{L} = \frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 + \sum_i \alpha_i - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^T\mathbf{x}_i$$

Expanding the squared norm: $$\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

The third term: $$\sum_i \alpha_i y_i \mathbf{w}^T\mathbf{x}_i = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_j^T \mathbf{x}_i = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

Combining: $$d(\boldsymbol{\alpha}) = \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j + \sum_i \alpha_i - \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

$$= \sum_i \alpha_i - \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

The SVM Dual Problem

The Hard-Margin SVM Dual:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

$$\text{s.t.} \quad \alpha_i \geq 0, ; \sum_i \alpha_i y_i = 0$$

Matrix Form:

Define the Gram matrix $G$ with $G_{ij} = y_i y_j \mathbf{x}_i^T \mathbf{x}_j$:

$$\max_{\boldsymbol{\alpha}} \mathbf{1}^T\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^T G \boldsymbol{\alpha}$$ $$\text{s.t.} \quad \boldsymbol{\alpha} \geq 0, ; \mathbf{y}^T\boldsymbol{\alpha} = 0$$

The Kernel Trick Gateway

Notice that data only appears through inner products xᵢᵀxⱼ! This is the key insight enabling the kernel trick: replace inner products with kernel evaluations k(xᵢ, xⱼ) to implicitly map data to high-dimensional feature spaces without computing the mapping explicitly.

svm_dual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy.optimize import minimize
 
def svm_dual_demo():
    """
    Demonstrate solving the hard-margin SVM dual
    
    Dual: max Σᵢαᵢ - (1/2)Σᵢⱼ αᵢαⱼyᵢyⱼxᵢᵀxⱼ
          s.t. αᵢ ≥ 0, Σᵢαᵢyᵢ = 0
    
    Equivalently (for minimization):
    min (1/2)αᵀGα - 1ᵀα  where G_ij = yᵢyⱼxᵢᵀxⱼ
    """
    
    # Simple linearly separable dataset
    np.random.seed(42)
    
    # Positive class (y = +1)
    X_pos = np.array([[2, 2], [2, 3], [3, 2]])
    # Negative class (y = -1)
    X_neg = np.array([[-1, -1], [-2, -1], [-1, -2]])
    
    X = np.vstack([X_pos, X_neg])
    y = np.array([1, 1, 1, -1, -1, -1])
    n = len(y)
    
    print("Hard-Margin SVM Dual Demo")
    print("=" * 60)
    print(f"\nDataset: {n} points in 2D")
    print(f"Positive class: {len(X_pos)} points")
    print(f"Negative class: {len(X_neg)} points")
    
    # Compute Gram matrix G_ij = y_i y_j x_i^T x_j
    G = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            G[i, j] = y[i] * y[j] * np.dot(X[i], X[j])
    
    print(f"\nGram matrix G (shape {G.shape}):")
    print(G)
    
    # Dual objective: min (1/2)α'Gα - 1'α
    def dual_objective(alpha):
        return 0.5 * alpha @ G @ alpha - np.sum(alpha)
    
    # Gradient of dual objective
    def dual_gradient(alpha):
        return G @ alpha - np.ones(n)
    
    # Constraints
    constraints = [
        {'type': 'eq', 'fun': lambda a: np.dot(y, a)},  # Σᵢαᵢyᵢ = 0
    ]
    
    bounds = [(0, None) for _ in range(n)]  # αᵢ ≥ 0
    
    # Solve
    alpha0 = np.ones(n) * 0.1
    result = minimize(
        dual_objective,
        alpha0,
        method='SLSQP',
        jac=dual_gradient,
        bounds=bounds,
        constraints=constraints
    )
    
    alpha_opt = result.x
    
    print(f"\nOptimal α: {alpha_opt}")
    
    # Identify support vectors (α > threshold)
    threshold = 1e-5
    sv_indices = np.where(alpha_opt > threshold)[0]
    print(f"\nSupport vector indices: {sv_indices}")
    print(f"Support vectors:")
    for i in sv_indices:
        print(f"  x_{i} = {X[i]}, y_{i} = {y[i]}, α_{i} = {alpha_opt[i]:.6f}")
    
    # Recover w from dual solution
    w = np.sum(alpha_opt[:, np.newaxis] * y[:, np.newaxis] * X, axis=0)
    print(f"\nRecovered weight vector: w = {w}")
    
    # Recover b from support vector (any will do)
    # From: y_i(w'x_i + b) = 1 for support vectors
    # So: b = y_i - w'x_i
    i_sv = sv_indices[0]
    b = y[i_sv] - np.dot(w, X[i_sv])
    print(f"Recovered bias: b = {b:.6f}")
    
    # Verify: compute margin for all points
    print(f"\nMargin verification (should be ≥ 1):")
    for i in range(n):
        margin = y[i] * (np.dot(w, X[i]) + b)
        status = "✓" if margin >= 1 - 1e-6 else "✗"
        sv_marker = " [SV]" if i in sv_indices else ""
        print(f"  Point {i}: margin = {margin:.4f} {status}{sv_marker}")
    
    # Primal objective value
    primal_obj = 0.5 * np.dot(w, w)
    # Dual objective value (negated because we minimized)
    dual_obj = -result.fun
    
    print(f"\nPrimal objective (1/2)||w||² = {primal_obj:.6f}")
    print(f"Dual objective = {dual_obj:.6f}")
    print(f"Duality gap = {abs(primal_obj - dual_obj):.6e}")
    
    return w, b, alpha_opt
 
svm_dual_demo()

Properties of the SVM Dual

The SVM dual reveals deep structure about the solution:

1. Sparsity and Support Vectors

By complementary slackness: $$\alpha_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) = 0$$

If $\alpha_i > 0$, then $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ (point is exactly on margin)
If $y_i(\mathbf{w}^T\mathbf{x}_i + b) > 1$, then $\alpha_i = 0$ (point doesn't contribute)

Only the support vectors (points on the margin) have $\alpha_i > 0$.

2. Predictions via Inner Products

Since $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$, predictions become:

$$\text{sign}(\mathbf{w}^T\mathbf{x} + b) = \text{sign}\left(\sum_i \alpha_i y_i \mathbf{x}_i^T\mathbf{x} + b\right)$$

Again, only inner products matter!

SVM Primal vs Dual Comparison
Aspect	Primal	Dual
Variables	$\mathbf{w} \in \mathbb{R}^d$, $b$	$\boldsymbol{\alpha} \in \mathbb{R}^n$
Scales with	Feature dimension $d$	Number of samples $n$
Objective	$(1/2)\|\mathbf{w}\|^2$	$\sum_i \alpha_i - (1/2)\boldsymbol{\alpha}^T G \boldsymbol{\alpha}$
Constraints	$n$ inequalities	$n$ box constraints + 1 equality
Kernel trick	Requires explicit features	Uses only $\mathbf{x}_i^T\mathbf{x}_j$
Complexity insight	Depends on $d$	Reveals support vectors

When to Use Which?

For linear SVMs with d << n (few features, many samples), the primal may be more efficient. For d >> n (many features, few samples) or when using kernels, the dual is essential. The kernel trick only works in the dual because it requires inner products between data points.

Extension: Soft-Margin SVM Dual

The hard-margin SVM requires perfectly linearly separable data. The soft-margin SVM allows misclassifications with a penalty:

Soft-Margin Primal:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Here $\xi_i$ are slack variables that allow margin violations, and $C > 0$ trades off margin width vs. violation penalty.

Soft-Margin Dual:

The derivation is similar. The result:

$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$ $$\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0$$

The Only Change: Box Constraints

The soft-margin dual differs from hard-margin only in the upper bound on α: instead of α ≥ 0, we have 0 ≤ α ≤ C. This 'box constraint' elegantly limits how much any single point can influence the solution. Points with α = C are 'saturated' support vectors—typically misclassified or close to the boundary.

Classification of Points by α:

Condition	Interpretation
$\alpha_i = 0$	Correctly classified, outside margin
$0 < \alpha_i < C$	Support vector, exactly on margin
$\alpha_i = C$	Support vector, inside margin or misclassified

The soft-margin formulation gracefully handles noisy or overlapping data while maintaining the elegant dual structure. The hyperparameter $C$ controls the bias-variance trade-off:

Large $C$: Less tolerance for violations → low bias, high variance
Small $C$: More tolerance for violations → high bias, low variance

Summary: Dual Problem Formulation

We've developed the machinery for deriving dual problems. Let's consolidate:

Key Takeaways

•Systematic derivation — Form Lagrangian, minimize over primal variables, identify domain constraints, state dual problem.
•LP duality is classical — min c'x s.t. Ax = b, x ≥ 0 dualizes to max b'ν s.t. A'ν ≤ c.
•QP duality preserves structure — Quadratic primal leads to quadratic dual; both are tractable.
•SVM dual reveals inner products — The optimal w is a weighted combination of training points, and predictions depend only on x_i'x_j.
•Support vectors emerge from KKT — Complementary slackness implies sparsity: only points on the margin contribute.
•Soft-margin adds box constraints — The regularization parameter C bounds α, creating robust solutions for non-separable data.

What's Next:

We've derived the SVM dual and seen hints of its power. The final page brings everything together with Applications to SVMs, showing how the constrained optimization framework we've developed enables kernel methods, efficient algorithms (SMO), and deep insights into SVM behavior.

Page Complete

You can now derive dual problems for linear programs, quadratic programs, and SVMs. You understand how the SVM dual reveals the support vector structure and enables the kernel trick. Next, we apply this complete framework to understand SVMs in depth.

4 / 5

Loading learning content...

Machine LearningConstrained Optimization

Constrained Optimization

LevelIntermediate

Duration90 mins

TopicConstrained Optimization

4 / 5

Dual Problem Formulation

From Theory to Practice

This page develops the systematic techniques for dual formulation, focusing on problem classes central to machine learning:

Linear programs (LP)
Quadratic programs (QP)
The SVM primal → dual transformation

What You Will Learn

The General Derivation Procedure

Deriving a dual problem follows a consistent pattern. Let's formalize it:

Step 1: Write the Primal in Standard Form

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0, ; h_j(\mathbf{x}) = 0$$

Step 2: Form the Lagrangian

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_i \alpha_i g_i(\mathbf{x}) + \sum_j \beta_j h_j(\mathbf{x})$$

with $\alpha_i \geq 0$.

Step 3: Compute the Dual Function

$$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

This requires minimizing over $\mathbf{x}$ for fixed multipliers. For many important cases, this has a closed-form solution.

Step 4: Identify Dual Constraints

The domain of $d$ (where it's finite) defines implicit dual constraints. Plus $\boldsymbol{\alpha} \geq 0$.

The Critical Skill: Evaluating the Infimum

Step 5: State the Dual Problem

$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \quad \text{s.t.} \quad \boldsymbol{\alpha} \geq 0, ; \text{(+ implicit constraints)}$$

Step 6: Simplify and Interpret

Often the dual can be written more elegantly by:

Eliminating $\boldsymbol{\beta}$ via its optimality conditions
Rewriting in terms of data inner products (for SVMs)
Recognizing familiar optimization problem forms

Duality for Linear Programs

Linear programming is the simplest case and illustrates the procedure beautifully.

Primal LP (Standard Form):

$$\min_{\mathbf{x}} \mathbf{c}^T \mathbf{x} \quad \text{s.t.} \quad A\mathbf{x} = \mathbf{b}, ; \mathbf{x} \geq 0$$

Step 1: Rewrite constraints

$A\mathbf{x} = \mathbf{b}$ (equality) → multipliers $\boldsymbol{\nu}$ (unrestricted)
$-\mathbf{x} \leq 0$ (non-negativity) → multipliers $\boldsymbol{\lambda} \geq 0$

Step 2: Lagrangian

Step 3: Minimize over $\mathbf{x}$

The Lagrangian is linear in $\mathbf{x}$: $$\mathcal{L} = \boldsymbol{\nu}^T\mathbf{b} + \underbrace{(\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda})^T}_{\text{coefficient}} \mathbf{x}$$

If the coefficient of $\mathbf{x}$ is nonzero, we can make $\mathcal{L} \to -\infty$ by taking $\mathbf{x}$ large in the right direction.

When this holds: $d(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \boldsymbol{\nu}^T\mathbf{b}$

Step 4 & 5: Dual Problem

Maximize $\boldsymbol{\nu}^T\mathbf{b}$ subject to:

$A^T\boldsymbol{\nu} + \boldsymbol{\lambda} = \mathbf{c}$
$\boldsymbol{\lambda} \geq 0$

Eliminating $\boldsymbol{\lambda}$: $\boldsymbol{\lambda} = \mathbf{c} - A^T\boldsymbol{\nu} \geq 0$ means $A^T\boldsymbol{\nu} \leq \mathbf{c}$.

Dual LP:

$$\max_{\boldsymbol{\nu}} \mathbf{b}^T \boldsymbol{\nu} \quad \text{s.t.} \quad A^T \boldsymbol{\nu} \leq \mathbf{c}$$

This is the classic LP duality result!

LP Primal-Dual Correspondence
Primal	Dual
min $\mathbf{c}^T\mathbf{x}$	max $\mathbf{b}^T\boldsymbol{\nu}$
$A\mathbf{x} = \mathbf{b}$	$\boldsymbol{\nu}$ unrestricted
$\mathbf{x} \geq 0$	$A^T\boldsymbol{\nu} \leq \mathbf{c}$
$n$ variables	$m$ variables
$m$ equality constraints	$n$ inequality constraints

Duality for Quadratic Programs

Quadratic programs (QP) are central to SVMs, ridge regression, and many ML applications.

Primal QP (Standard Form):

$$\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T\mathbf{x}$$ $$\text{s.t.} \quad A\mathbf{x} \leq \mathbf{b}$$

where $Q$ is symmetric positive semi-definite (for convexity).

Step 1 & 2: Lagrangian

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T\mathbf{x} + \boldsymbol{\alpha}^T(A\mathbf{x} - \mathbf{b})$$

with $\boldsymbol{\alpha} \geq 0$.

Rearranging: $$\mathcal{L} = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + (\mathbf{c} + A^T\boldsymbol{\alpha})^T\mathbf{x} - \boldsymbol{\alpha}^T\mathbf{b}$$

Step 3: Minimize over $\mathbf{x}$

For positive definite $Q$, the Lagrangian is strictly convex in $\mathbf{x}$. Set:

$$\nabla_{\mathbf{x}} \mathcal{L} = Q\mathbf{x} + \mathbf{c} + A^T\boldsymbol{\alpha} = 0$$

Solving: $$\mathbf{x}^* = -Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})$$

(assuming $Q$ is invertible; for semi-definite $Q$, more care is needed)

Substitute back:

After simplification (using $Q$'s symmetry):

$$d(\boldsymbol{\alpha}) = -\frac{1}{2}(\mathbf{c} + A^T\boldsymbol{\alpha})^T Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}) - \boldsymbol{\alpha}^T\mathbf{b}$$

Dual QP is Also Quadratic

Dual QP:

$$\max_{\boldsymbol{\alpha} \geq 0} -\frac{1}{2}(\mathbf{c} + A^T\boldsymbol{\alpha})^T Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}) - \boldsymbol{\alpha}^T\mathbf{b}$$

Or equivalently (switching to minimization and negating):

The constant term doesn't affect the optimal $\boldsymbol{\alpha}$.

The SVM Primal Problem

Let's derive the dual of the hard-margin SVM—the most celebrated application of Lagrangian duality in machine learning.

Hard-Margin SVM Primal:

Given training data $(\mathbf{x}_i, y_i)$ for $i = 1, \ldots, n$ with $y_i \in {-1, +1}$:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n$$

Interpretation:

Minimize $|\mathbf{w}|^2$ → maximize the margin $\frac{2}{|\mathbf{w}|}$
Constraints ensure all points are correctly classified with margin ≥ 1

Convert to Standard Form:

$$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$$

Why This Formulation?

The Lagrangian:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 + \sum_{i=1}^{n} \alpha_i \left(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)\right)$$

$$= \frac{1}{2}|\mathbf{w}|^2 + \sum_i \alpha_i - \sum_i \alpha_i y_i \mathbf{w}^T\mathbf{x}_i - b\sum_i \alpha_i y_i$$

with $\alpha_i \geq 0$ for all $i$.

Deriving the SVM Dual

Now we minimize the Lagrangian over $(\mathbf{w}, b)$.

Step 1: Minimize over $b$

$$\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0$$

This gives the constraint: $\sum_i \alpha_i y_i = 0$

If this doesn't hold, the Lagrangian is linear in $b$ and can go to $-\infty$.

Step 2: Minimize over $\mathbf{w}$

$$\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0$$

Solving: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$

This is beautiful: the optimal $\mathbf{w}$ is a linear combination of training points!

The Support Vector Revelation

Step 3: Substitute Back

With $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ and $\sum_i \alpha_i y_i = 0$:

$$\mathcal{L} = \frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 + \sum_i \alpha_i - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^T\mathbf{x}_i$$

Expanding the squared norm: $$\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

$$= \sum_i \alpha_i - \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

The SVM Dual Problem

The Hard-Margin SVM Dual:

$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

$$\text{s.t.} \quad \alpha_i \geq 0, ; \sum_i \alpha_i y_i = 0$$

Matrix Form:

Define the Gram matrix $G$ with $G_{ij} = y_i y_j \mathbf{x}_i^T \mathbf{x}_j$:

The Kernel Trick Gateway

svm_dual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy.optimize import minimize
 
def svm_dual_demo():
    """
    Demonstrate solving the hard-margin SVM dual
    
    Dual: max Σᵢαᵢ - (1/2)Σᵢⱼ αᵢαⱼyᵢyⱼxᵢᵀxⱼ
          s.t. αᵢ ≥ 0, Σᵢαᵢyᵢ = 0
    
    Equivalently (for minimization):
    min (1/2)αᵀGα - 1ᵀα  where G_ij = yᵢyⱼxᵢᵀxⱼ
    """
    
    # Simple linearly separable dataset
    np.random.seed(42)
    
    # Positive class (y = +1)
    X_pos = np.array([[2, 2], [2, 3], [3, 2]])
    # Negative class (y = -1)
    X_neg = np.array([[-1, -1], [-2, -1], [-1, -2]])
    
    X = np.vstack([X_pos, X_neg])
    y = np.array([1, 1, 1, -1, -1, -1])
    n = len(y)
    
    print("Hard-Margin SVM Dual Demo")
    print("=" * 60)
    print(f"\nDataset: {n} points in 2D")
    print(f"Positive class: {len(X_pos)} points")
    print(f"Negative class: {len(X_neg)} points")
    
    # Compute Gram matrix G_ij = y_i y_j x_i^T x_j
    G = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            G[i, j] = y[i] * y[j] * np.dot(X[i], X[j])
    
    print(f"\nGram matrix G (shape {G.shape}):")
    print(G)
    
    # Dual objective: min (1/2)α'Gα - 1'α
    def dual_objective(alpha):
        return 0.5 * alpha @ G @ alpha - np.sum(alpha)
    
    # Gradient of dual objective
    def dual_gradient(alpha):
        return G @ alpha - np.ones(n)
    
    # Constraints
    constraints = [
        {'type': 'eq', 'fun': lambda a: np.dot(y, a)},  # Σᵢαᵢyᵢ = 0
    ]
    
    bounds = [(0, None) for _ in range(n)]  # αᵢ ≥ 0
    
    # Solve
    alpha0 = np.ones(n) * 0.1
    result = minimize(
        dual_objective,
        alpha0,
        method='SLSQP',
        jac=dual_gradient,
        bounds=bounds,
        constraints=constraints
    )
    
    alpha_opt = result.x
    
    print(f"\nOptimal α: {alpha_opt}")
    
    # Identify support vectors (α > threshold)
    threshold = 1e-5
    sv_indices = np.where(alpha_opt > threshold)[0]
    print(f"\nSupport vector indices: {sv_indices}")
    print(f"Support vectors:")
    for i in sv_indices:
        print(f"  x_{i} = {X[i]}, y_{i} = {y[i]}, α_{i} = {alpha_opt[i]:.6f}")
    
    # Recover w from dual solution
    w = np.sum(alpha_opt[:, np.newaxis] * y[:, np.newaxis] * X, axis=0)
    print(f"\nRecovered weight vector: w = {w}")
    
    # Recover b from support vector (any will do)
    # From: y_i(w'x_i + b) = 1 for support vectors
    # So: b = y_i - w'x_i
    i_sv = sv_indices[0]
    b = y[i_sv] - np.dot(w, X[i_sv])
    print(f"Recovered bias: b = {b:.6f}")
    
    # Verify: compute margin for all points
    print(f"\nMargin verification (should be ≥ 1):")
    for i in range(n):
        margin = y[i] * (np.dot(w, X[i]) + b)
        status = "✓" if margin >= 1 - 1e-6 else "✗"
        sv_marker = " [SV]" if i in sv_indices else ""
        print(f"  Point {i}: margin = {margin:.4f} {status}{sv_marker}")
    
    # Primal objective value
    primal_obj = 0.5 * np.dot(w, w)
    # Dual objective value (negated because we minimized)
    dual_obj = -result.fun
    
    print(f"\nPrimal objective (1/2)||w||² = {primal_obj:.6f}")
    print(f"Dual objective = {dual_obj:.6f}")
    print(f"Duality gap = {abs(primal_obj - dual_obj):.6e}")
    
    return w, b, alpha_opt
 
svm_dual_demo()

Properties of the SVM Dual

The SVM dual reveals deep structure about the solution:

1. Sparsity and Support Vectors

By complementary slackness: $$\alpha_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) = 0$$

If $\alpha_i > 0$, then $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ (point is exactly on margin)
If $y_i(\mathbf{w}^T\mathbf{x}_i + b) > 1$, then $\alpha_i = 0$ (point doesn't contribute)

Only the support vectors (points on the margin) have $\alpha_i > 0$.

2. Predictions via Inner Products

Since $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$, predictions become:

$$\text{sign}(\mathbf{w}^T\mathbf{x} + b) = \text{sign}\left(\sum_i \alpha_i y_i \mathbf{x}_i^T\mathbf{x} + b\right)$$

Again, only inner products matter!

SVM Primal vs Dual Comparison
Aspect	Primal	Dual
Variables	$\mathbf{w} \in \mathbb{R}^d$, $b$	$\boldsymbol{\alpha} \in \mathbb{R}^n$
Scales with	Feature dimension $d$	Number of samples $n$
Objective	$(1/2)\|\mathbf{w}\|^2$	$\sum_i \alpha_i - (1/2)\boldsymbol{\alpha}^T G \boldsymbol{\alpha}$
Constraints	$n$ inequalities	$n$ box constraints + 1 equality
Kernel trick	Requires explicit features	Uses only $\mathbf{x}_i^T\mathbf{x}_j$
Complexity insight	Depends on $d$	Reveals support vectors

When to Use Which?

Extension: Soft-Margin SVM Dual

The hard-margin SVM requires perfectly linearly separable data. The soft-margin SVM allows misclassifications with a penalty:

Soft-Margin Primal:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Here $\xi_i$ are slack variables that allow margin violations, and $C > 0$ trades off margin width vs. violation penalty.

Soft-Margin Dual:

The derivation is similar. The result:

$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$ $$\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0$$

The Only Change: Box Constraints

Classification of Points by α:

Condition	Interpretation
$\alpha_i = 0$	Correctly classified, outside margin
$0 < \alpha_i < C$	Support vector, exactly on margin
$\alpha_i = C$	Support vector, inside margin or misclassified

The soft-margin formulation gracefully handles noisy or overlapping data while maintaining the elegant dual structure. The hyperparameter $C$ controls the bias-variance trade-off:

Large $C$: Less tolerance for violations → low bias, high variance
Small $C$: More tolerance for violations → high bias, low variance

Summary: Dual Problem Formulation

We've developed the machinery for deriving dual problems. Let's consolidate:

Key Takeaways

•Systematic derivation — Form Lagrangian, minimize over primal variables, identify domain constraints, state dual problem.
•LP duality is classical — min c'x s.t. Ax = b, x ≥ 0 dualizes to max b'ν s.t. A'ν ≤ c.
•QP duality preserves structure — Quadratic primal leads to quadratic dual; both are tractable.
•SVM dual reveals inner products — The optimal w is a weighted combination of training points, and predictions depend only on x_i'x_j.
•Support vectors emerge from KKT — Complementary slackness implies sparsity: only points on the margin contribute.
•Soft-margin adds box constraints — The regularization parameter C bounds α, creating robust solutions for non-separable data.

What's Next:

Page Complete

4 / 5