Loading learning content...
We've established the theoretical foundations of duality: weak duality, strong duality, and the profound insight that dual problems are always convex. Now comes the practical question: how do we actually derive dual problems for specific cases?
This page develops the systematic techniques for dual formulation, focusing on problem classes central to machine learning:
Mastering these derivations isn't just academic exercise—it reveals why SVMs work with kernels, why certain algorithms are efficient, and how to exploit problem structure for computational advantage.
By the end of this page, you will be able to derive dual problems for linear and quadratic programs, understand the general procedure, and see how the SVM dual emerges naturally—setting the stage for understanding kernel methods and support vector structure.
Deriving a dual problem follows a consistent pattern. Let's formalize it:
Step 1: Write the Primal in Standard Form
$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0, ; h_j(\mathbf{x}) = 0$$
Step 2: Form the Lagrangian
$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_i \alpha_i g_i(\mathbf{x}) + \sum_j \beta_j h_j(\mathbf{x})$$
with $\alpha_i \geq 0$.
Step 3: Compute the Dual Function
$$d(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$
This requires minimizing over $\mathbf{x}$ for fixed multipliers. For many important cases, this has a closed-form solution.
Step 4: Identify Dual Constraints
The domain of $d$ (where it's finite) defines implicit dual constraints. Plus $\boldsymbol{\alpha} \geq 0$.
The hardest step is Step 3—computing inf_x L. For unconstrained optimization in x, set ∇_x L = 0 and solve. The result expresses x in terms of (α, β). Substitute back into L to get d(α, β). Watch for: (1) cases where L is unbounded below → d = -∞ for those (α, β), and (2) implicit constraints that emerge from requiring the infimum to exist.
Step 5: State the Dual Problem
$$\max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} d(\boldsymbol{\alpha}, \boldsymbol{\beta}) \quad \text{s.t.} \quad \boldsymbol{\alpha} \geq 0, ; \text{(+ implicit constraints)}$$
Step 6: Simplify and Interpret
Often the dual can be written more elegantly by:
Linear programming is the simplest case and illustrates the procedure beautifully.
Primal LP (Standard Form):
$$\min_{\mathbf{x}} \mathbf{c}^T \mathbf{x} \quad \text{s.t.} \quad A\mathbf{x} = \mathbf{b}, ; \mathbf{x} \geq 0$$
Step 1: Rewrite constraints
Step 2: Lagrangian
$$\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = \mathbf{c}^T\mathbf{x} + \boldsymbol{\nu}^T(\mathbf{b} - A\mathbf{x}) - \boldsymbol{\lambda}^T \mathbf{x}$$ $$= \mathbf{c}^T\mathbf{x} + \boldsymbol{\nu}^T\mathbf{b} - \boldsymbol{\nu}^T A\mathbf{x} - \boldsymbol{\lambda}^T \mathbf{x}$$ $$= \boldsymbol{\nu}^T\mathbf{b} + (\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda})^T \mathbf{x}$$
Step 3: Minimize over $\mathbf{x}$
The Lagrangian is linear in $\mathbf{x}$: $$\mathcal{L} = \boldsymbol{\nu}^T\mathbf{b} + \underbrace{(\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda})^T}_{\text{coefficient}} \mathbf{x}$$
If the coefficient of $\mathbf{x}$ is nonzero, we can make $\mathcal{L} \to -\infty$ by taking $\mathbf{x}$ large in the right direction.
So $d(\boldsymbol{\lambda}, \boldsymbol{\nu})$ is finite only when: $$\mathbf{c} - A^T\boldsymbol{\nu} - \boldsymbol{\lambda} = 0 \quad \Rightarrow \quad A^T\boldsymbol{\nu} + \boldsymbol{\lambda} = \mathbf{c}$$
When this holds: $d(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \boldsymbol{\nu}^T\mathbf{b}$
Step 4 & 5: Dual Problem
Maximize $\boldsymbol{\nu}^T\mathbf{b}$ subject to:
Eliminating $\boldsymbol{\lambda}$: $\boldsymbol{\lambda} = \mathbf{c} - A^T\boldsymbol{\nu} \geq 0$ means $A^T\boldsymbol{\nu} \leq \mathbf{c}$.
Dual LP:
$$\max_{\boldsymbol{\nu}} \mathbf{b}^T \boldsymbol{\nu} \quad \text{s.t.} \quad A^T \boldsymbol{\nu} \leq \mathbf{c}$$
This is the classic LP duality result!
| Primal | Dual |
|---|---|
| min $\mathbf{c}^T\mathbf{x}$ | max $\mathbf{b}^T\boldsymbol{\nu}$ |
| $A\mathbf{x} = \mathbf{b}$ | $\boldsymbol{\nu}$ unrestricted |
| $\mathbf{x} \geq 0$ | $A^T\boldsymbol{\nu} \leq \mathbf{c}$ |
| $n$ variables | $m$ variables |
| $m$ equality constraints | $n$ inequality constraints |
Quadratic programs (QP) are central to SVMs, ridge regression, and many ML applications.
Primal QP (Standard Form):
$$\min_{\mathbf{x}} \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T\mathbf{x}$$ $$\text{s.t.} \quad A\mathbf{x} \leq \mathbf{b}$$
where $Q$ is symmetric positive semi-definite (for convexity).
Step 1 & 2: Lagrangian
$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T\mathbf{x} + \boldsymbol{\alpha}^T(A\mathbf{x} - \mathbf{b})$$
with $\boldsymbol{\alpha} \geq 0$.
Rearranging: $$\mathcal{L} = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} + (\mathbf{c} + A^T\boldsymbol{\alpha})^T\mathbf{x} - \boldsymbol{\alpha}^T\mathbf{b}$$
Step 3: Minimize over $\mathbf{x}$
For positive definite $Q$, the Lagrangian is strictly convex in $\mathbf{x}$. Set:
$$\nabla_{\mathbf{x}} \mathcal{L} = Q\mathbf{x} + \mathbf{c} + A^T\boldsymbol{\alpha} = 0$$
Solving: $$\mathbf{x}^* = -Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})$$
(assuming $Q$ is invertible; for semi-definite $Q$, more care is needed)
Substitute back:
$$d(\boldsymbol{\alpha}) = \frac{1}{2}(-Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}))^T Q (-Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})) + (\mathbf{c} + A^T\boldsymbol{\alpha})^T(-Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha})) - \boldsymbol{\alpha}^T\mathbf{b}$$
After simplification (using $Q$'s symmetry):
$$d(\boldsymbol{\alpha}) = -\frac{1}{2}(\mathbf{c} + A^T\boldsymbol{\alpha})^T Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}) - \boldsymbol{\alpha}^T\mathbf{b}$$
The dual function is quadratic (concave) in α. Maximizing a concave quadratic subject to α ≥ 0 is another QP. This means QP duality leads to another QP—just in different variables and often with different structure that may be computationally advantageous.
Dual QP:
$$\max_{\boldsymbol{\alpha} \geq 0} -\frac{1}{2}(\mathbf{c} + A^T\boldsymbol{\alpha})^T Q^{-1}(\mathbf{c} + A^T\boldsymbol{\alpha}) - \boldsymbol{\alpha}^T\mathbf{b}$$
Or equivalently (switching to minimization and negating):
$$\min_{\boldsymbol{\alpha} \geq 0} \frac{1}{2}\boldsymbol{\alpha}^T (A Q^{-1} A^T) \boldsymbol{\alpha} + (\mathbf{b} + A Q^{-1}\mathbf{c})^T\boldsymbol{\alpha} + \frac{1}{2}\mathbf{c}^T Q^{-1}\mathbf{c}$$
The constant term doesn't affect the optimal $\boldsymbol{\alpha}$.
Let's derive the dual of the hard-margin SVM—the most celebrated application of Lagrangian duality in machine learning.
Hard-Margin SVM Primal:
Given training data $(\mathbf{x}_i, y_i)$ for $i = 1, \ldots, n$ with $y_i \in {-1, +1}$:
$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, n$$
Interpretation:
Convert to Standard Form:
$$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$$
Minimizing ||w||² with margin constraints is equivalent to maximizing the geometric margin between classes. The hard-margin SVM finds the maximum-margin separating hyperplane—the classifier that's most 'confident' on all training points.
The Lagrangian:
$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 + \sum_{i=1}^{n} \alpha_i \left(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)\right)$$
$$= \frac{1}{2}|\mathbf{w}|^2 + \sum_i \alpha_i - \sum_i \alpha_i y_i \mathbf{w}^T\mathbf{x}_i - b\sum_i \alpha_i y_i$$
with $\alpha_i \geq 0$ for all $i$.
Now we minimize the Lagrangian over $(\mathbf{w}, b)$.
Step 1: Minimize over $b$
$$\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0$$
This gives the constraint: $\sum_i \alpha_i y_i = 0$
If this doesn't hold, the Lagrangian is linear in $b$ and can go to $-\infty$.
Step 2: Minimize over $\mathbf{w}$
$$\nabla_{\mathbf{w}} \mathcal{L} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0$$
Solving: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$
This is beautiful: the optimal $\mathbf{w}$ is a linear combination of training points!
The formula w = Σᵢ αᵢyᵢxᵢ shows that w is built from training points weighted by their multipliers. Points with αᵢ = 0 don't contribute at all. The points with αᵢ > 0 are the 'support vectors'—they're the only data that defines the decision boundary.
Step 3: Substitute Back
With $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ and $\sum_i \alpha_i y_i = 0$:
$$\mathcal{L} = \frac{1}{2}\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 + \sum_i \alpha_i - \sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^T\mathbf{x}_i$$
Expanding the squared norm: $$\left|\sum_j \alpha_j y_j \mathbf{x}_j\right|^2 = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
The third term: $$\sum_i \alpha_i y_i \mathbf{w}^T\mathbf{x}_i = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_j^T \mathbf{x}_i = \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
Combining: $$d(\boldsymbol{\alpha}) = \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j + \sum_i \alpha_i - \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
$$= \sum_i \alpha_i - \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
The Hard-Margin SVM Dual:
$$\max_{\boldsymbol{\alpha}} \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
$$\text{s.t.} \quad \alpha_i \geq 0, ; \sum_i \alpha_i y_i = 0$$
Matrix Form:
Define the Gram matrix $G$ with $G_{ij} = y_i y_j \mathbf{x}_i^T \mathbf{x}_j$:
$$\max_{\boldsymbol{\alpha}} \mathbf{1}^T\boldsymbol{\alpha} - \frac{1}{2}\boldsymbol{\alpha}^T G \boldsymbol{\alpha}$$ $$\text{s.t.} \quad \boldsymbol{\alpha} \geq 0, ; \mathbf{y}^T\boldsymbol{\alpha} = 0$$
Notice that data only appears through inner products xᵢᵀxⱼ! This is the key insight enabling the kernel trick: replace inner products with kernel evaluations k(xᵢ, xⱼ) to implicitly map data to high-dimensional feature spaces without computing the mapping explicitly.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom scipy.optimize import minimize def svm_dual_demo(): """ Demonstrate solving the hard-margin SVM dual Dual: max Σᵢαᵢ - (1/2)Σᵢⱼ αᵢαⱼyᵢyⱼxᵢᵀxⱼ s.t. αᵢ ≥ 0, Σᵢαᵢyᵢ = 0 Equivalently (for minimization): min (1/2)αᵀGα - 1ᵀα where G_ij = yᵢyⱼxᵢᵀxⱼ """ # Simple linearly separable dataset np.random.seed(42) # Positive class (y = +1) X_pos = np.array([[2, 2], [2, 3], [3, 2]]) # Negative class (y = -1) X_neg = np.array([[-1, -1], [-2, -1], [-1, -2]]) X = np.vstack([X_pos, X_neg]) y = np.array([1, 1, 1, -1, -1, -1]) n = len(y) print("Hard-Margin SVM Dual Demo") print("=" * 60) print(f"\nDataset: {n} points in 2D") print(f"Positive class: {len(X_pos)} points") print(f"Negative class: {len(X_neg)} points") # Compute Gram matrix G_ij = y_i y_j x_i^T x_j G = np.zeros((n, n)) for i in range(n): for j in range(n): G[i, j] = y[i] * y[j] * np.dot(X[i], X[j]) print(f"\nGram matrix G (shape {G.shape}):") print(G) # Dual objective: min (1/2)α'Gα - 1'α def dual_objective(alpha): return 0.5 * alpha @ G @ alpha - np.sum(alpha) # Gradient of dual objective def dual_gradient(alpha): return G @ alpha - np.ones(n) # Constraints constraints = [ {'type': 'eq', 'fun': lambda a: np.dot(y, a)}, # Σᵢαᵢyᵢ = 0 ] bounds = [(0, None) for _ in range(n)] # αᵢ ≥ 0 # Solve alpha0 = np.ones(n) * 0.1 result = minimize( dual_objective, alpha0, method='SLSQP', jac=dual_gradient, bounds=bounds, constraints=constraints ) alpha_opt = result.x print(f"\nOptimal α: {alpha_opt}") # Identify support vectors (α > threshold) threshold = 1e-5 sv_indices = np.where(alpha_opt > threshold)[0] print(f"\nSupport vector indices: {sv_indices}") print(f"Support vectors:") for i in sv_indices: print(f" x_{i} = {X[i]}, y_{i} = {y[i]}, α_{i} = {alpha_opt[i]:.6f}") # Recover w from dual solution w = np.sum(alpha_opt[:, np.newaxis] * y[:, np.newaxis] * X, axis=0) print(f"\nRecovered weight vector: w = {w}") # Recover b from support vector (any will do) # From: y_i(w'x_i + b) = 1 for support vectors # So: b = y_i - w'x_i i_sv = sv_indices[0] b = y[i_sv] - np.dot(w, X[i_sv]) print(f"Recovered bias: b = {b:.6f}") # Verify: compute margin for all points print(f"\nMargin verification (should be ≥ 1):") for i in range(n): margin = y[i] * (np.dot(w, X[i]) + b) status = "✓" if margin >= 1 - 1e-6 else "✗" sv_marker = " [SV]" if i in sv_indices else "" print(f" Point {i}: margin = {margin:.4f} {status}{sv_marker}") # Primal objective value primal_obj = 0.5 * np.dot(w, w) # Dual objective value (negated because we minimized) dual_obj = -result.fun print(f"\nPrimal objective (1/2)||w||² = {primal_obj:.6f}") print(f"Dual objective = {dual_obj:.6f}") print(f"Duality gap = {abs(primal_obj - dual_obj):.6e}") return w, b, alpha_opt svm_dual_demo()The SVM dual reveals deep structure about the solution:
1. Sparsity and Support Vectors
By complementary slackness: $$\alpha_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) = 0$$
Only the support vectors (points on the margin) have $\alpha_i > 0$.
2. Predictions via Inner Products
Since $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$, predictions become:
$$\text{sign}(\mathbf{w}^T\mathbf{x} + b) = \text{sign}\left(\sum_i \alpha_i y_i \mathbf{x}_i^T\mathbf{x} + b\right)$$
Again, only inner products matter!
| Aspect | Primal | Dual |
|---|---|---|
| Variables | $\mathbf{w} \in \mathbb{R}^d$, $b$ | $\boldsymbol{\alpha} \in \mathbb{R}^n$ |
| Scales with | Feature dimension $d$ | Number of samples $n$ |
| Objective | $(1/2)|\mathbf{w}|^2$ | $\sum_i \alpha_i - (1/2)\boldsymbol{\alpha}^T G \boldsymbol{\alpha}$ |
| Constraints | $n$ inequalities | $n$ box constraints + 1 equality |
| Kernel trick | Requires explicit features | Uses only $\mathbf{x}_i^T\mathbf{x}_j$ |
| Complexity insight | Depends on $d$ | Reveals support vectors |
For linear SVMs with d << n (few features, many samples), the primal may be more efficient. For d >> n (many features, few samples) or when using kernels, the dual is essential. The kernel trick only works in the dual because it requires inner products between data points.
The hard-margin SVM requires perfectly linearly separable data. The soft-margin SVM allows misclassifications with a penalty:
Soft-Margin Primal:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$
Here $\xi_i$ are slack variables that allow margin violations, and $C > 0$ trades off margin width vs. violation penalty.
Soft-Margin Dual:
The derivation is similar. The result:
$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$$ $$\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0$$
The soft-margin dual differs from hard-margin only in the upper bound on α: instead of α ≥ 0, we have 0 ≤ α ≤ C. This 'box constraint' elegantly limits how much any single point can influence the solution. Points with α = C are 'saturated' support vectors—typically misclassified or close to the boundary.
Classification of Points by α:
| Condition | Interpretation |
|---|---|
| $\alpha_i = 0$ | Correctly classified, outside margin |
| $0 < \alpha_i < C$ | Support vector, exactly on margin |
| $\alpha_i = C$ | Support vector, inside margin or misclassified |
The soft-margin formulation gracefully handles noisy or overlapping data while maintaining the elegant dual structure. The hyperparameter $C$ controls the bias-variance trade-off:
We've developed the machinery for deriving dual problems. Let's consolidate:
What's Next:
We've derived the SVM dual and seen hints of its power. The final page brings everything together with Applications to SVMs, showing how the constrained optimization framework we've developed enables kernel methods, efficient algorithms (SMO), and deep insights into SVM behavior.
You can now derive dual problems for linear programs, quadratic programs, and SVMs. You understand how the SVM dual reveals the support vector structure and enables the kernel trick. Next, we apply this complete framework to understand SVMs in depth.