Loading content...
We have formulated hard margin SVM as a convex quadratic program and understood how general QP solvers approach such problems. Now we take a different path—one that reveals deep structure and ultimately enables the famous kernel trick.
The Lagrangian method transforms constrained optimization into unconstrained optimization by incorporating constraints into the objective via Lagrange multipliers. For convex problems, this leads to duality theory: the primal and dual problems are equivalent, providing two perspectives on the same solution.
This page develops the complete Lagrangian derivation of hard margin SVM. We will construct the Lagrangian, derive the conditions for optimality, and see how minimizing over primal variables yields an elegant dual formulation.
By the end of this page, you will: (1) understand the Lagrangian function and its role in constrained optimization, (2) construct the Lagrangian for hard margin SVM, (3) derive the stationarity conditions, (4) perform the minimization to eliminate primal variables, and (5) obtain the dual objective function.
Before diving into SVM specifics, let's understand the general framework of Lagrangian duality.
The General Constrained Problem:
$$\begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \ \text{s.t.} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1,\ldots,m \ & h_j(\mathbf{x}) = 0, \quad j = 1,\ldots,p \end{aligned}$$
This is the primal problem. Let $p^*$ denote its optimal value.
The Lagrangian Function:
The Lagrangian incorporates constraints into the objective:
$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \beta_j h_j(\mathbf{x})$$
where:
Lagrange multipliers act as "prices" for constraint violations. When $\alpha_i > 0$, any violation of constraint $g_i \leq 0$ increases the Lagrangian, penalizing infeasibility. When $\alpha_i = 0$, the constraint doesn't affect the objective—it's slack. This mirrors the economics interpretation where multipliers represent shadow prices.
The Dual Function:
The Lagrangian dual function minimizes the Lagrangian over primal variables:
$$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$
This function is always concave (as an infimum of linear functions in $(\boldsymbol{\alpha}, \boldsymbol{\beta})$), regardless of whether the primal is convex.
The Dual Problem:
$$\begin{aligned} \max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} \quad & g(\boldsymbol{\alpha}, \boldsymbol{\beta}) \ \text{s.t.} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,m \end{aligned}$$
Let $d^*$ denote its optimal value.
Weak Duality:
$$d^* \leq p^*$$
The dual optimal value never exceeds the primal optimal value. This holds for any optimization problem.
Strong Duality:
For convex problems satisfying constraint qualifications (e.g., Slater's condition):
$$d^* = p^*$$
The primal and dual have the same optimal value. This is the remarkable result that makes duality powerful.
Now we apply the Lagrangian framework to the hard margin SVM primal.
The Primal Problem (restated):
$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0, \quad i = 1,\ldots,n \end{aligned}$$
Note: We've rewritten $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1$ as $1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$ to match the standard form $g_i(\mathbf{x}) \leq 0$.
Introduce Lagrange Multipliers:
For each constraint $i = 1,\ldots,n$, we introduce a multiplier $\alpha_i \geq 0$.
The SVM Lagrangian:
$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i \left[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 \right]$$
where $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n) \geq \mathbf{0}$ are the Lagrange multipliers.
Expanding the Lagrangian:
$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}i - b\sum{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i$$
Let's identify the components:
Why This Sign Convention?
We wrote the constraint as $g_i = 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$, so the Lagrangian is: $$f + \sum \alpha_i g_i = \frac{1}{2}|\mathbf{w}|^2 + \sum \alpha_i[1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)]$$
Rearranging gives the form above. The equivalence is verified by expanding.
| Term | Expression | Type in $\mathbf{w}$ | Type in $b$ |
|---|---|---|---|
| Objective | $\frac{1}{2}\mathbf{w}^\top\mathbf{w}$ | Quadratic | Constant |
| Data coupling | $-\sum \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$ | Linear | Constant |
| Bias term | $-b\sum \alpha_i y_i$ | Constant | Linear |
| Constant | $\sum \alpha_i$ | Constant | Constant |
The Lagrangian provides a minimax characterization of the primal problem.
Key Observation:
For any fixed $(\mathbf{w}, b)$:
$$\sup_{\boldsymbol{\alpha} \geq \mathbf{0}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \begin{cases} \frac{1}{2}|\mathbf{w}|^2 & \text{if } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \ \forall i \ +\infty & \text{otherwise} \end{cases}$$
Proof:
If $(\mathbf{w}, b)$ is feasible: All constraints satisfied, so $y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 \geq 0$.
If $(\mathbf{w}, b)$ violates constraint $j$: $y_j(\mathbf{w}^\top\mathbf{x}_j + b) - 1 < 0$
The primal problem is equivalent to: $$p^* = \min_{\mathbf{w}, b} \sup_{\boldsymbol{\alpha} \geq \mathbf{0}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$$
The inner supremum enforces feasibility (infinite penalty for violations), while the outer minimization finds the optimal feasible point.
The Dual Problem:
Swapping the order of min and max:
$$d^* = \max_{\boldsymbol{\alpha} \geq \mathbf{0}} \min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$$
This is the dual problem. By weak duality: $d^* \leq p^*$.
Strong Duality for SVM:
The SVM primal is convex and satisfies Slater's condition (there exists a strictly feasible point when data is linearly separable). Therefore:
$$d^* = p^*$$
We can solve either problem to find the optimal solution!
Why Solve the Dual?
To find $\min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$, we set the partial derivatives to zero.
Derivative with Respect to $\mathbf{w}$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i = \mathbf{0}$$
Solving for $\mathbf{w}$:
$$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$$
The optimal weight vector is a linear combination of the training points, weighted by $\alpha_i y_i$.
Derivative with Respect to $b$:
$$\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \alpha_i y_i = 0$$
$$\sum_{i=1}^n \alpha_i y_i = 0$$
The weighted sum of labels must be zero. This is an equality constraint on the dual variables.
Physical Interpretation:
The condition $\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$ has profound implications:
Representer Theorem: The optimal $\mathbf{w}$ lies in the span of the training data. We don't need to search over all possible weight vectors—just linear combinations of data points.
Support Vector Sparsity: Since $\alpha_i = 0$ for non-support-vectors (by complementary slackness), $\mathbf{w}$ is a linear combination of only the support vectors: $$\mathbf{w} = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i$$
Class Balance: The condition $\sum \alpha_i y_i = 0$ means positive and negative class contributions balance. Since $y_i \in {-1, +1}$: $$\sum_{i: y_i = +1} \alpha_i = \sum_{i: y_i = -1} \alpha_i$$
With the stationarity conditions in hand, we substitute back into the Lagrangian to eliminate $\mathbf{w}$ and $b$, obtaining the dual objective.
Starting Point:
$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}i - b\sum{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i$$
Substituting $\mathbf{w} = \sum_j \alpha_j y_j \mathbf{x}_j$:
Term 1: $\frac{1}{2}\mathbf{w}^\top\mathbf{w}$
$$\frac{1}{2}\mathbf{w}^\top\mathbf{w} = \frac{1}{2}\left(\sum_i \alpha_i y_i \mathbf{x}_i\right)^\top \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)$$
$$= \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$
Term 2: $-\sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$
$$-\sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i = -\sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^\top \mathbf{x}_i$$
$$= -\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_j^\top \mathbf{x}_i = -\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$
Term 3: $-b\sum_i \alpha_i y_i$
By the second stationarity condition, $\sum_i \alpha_i y_i = 0$, so: $$-b\sum_i \alpha_i y_i = 0$$
Term 4: $\sum_i \alpha_i$
Remains as is.
Notice that Term 2 is exactly $-2 \times$ Term 1! When we combine: $$\frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j - \sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j = -\frac{1}{2}\sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$
Combining All Terms:
$$g(\boldsymbol{\alpha}) = \min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$
This is the dual objective function $g(\boldsymbol{\alpha})$.
Matrix Notation:
Defining the Gram matrix $\mathbf{K}$ with entries $K_{ij} = \mathbf{x}_i^\top\mathbf{x}_j$, and the vector $\mathbf{y} = (y_1, \ldots, y_n)^\top$:
$$g(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{1} - \frac{1}{2}\boldsymbol{\alpha}^\top (\mathbf{y}\mathbf{y}^\top \odot \mathbf{K}) \boldsymbol{\alpha}$$
where $\odot$ denotes element-wise (Hadamard) product, or equivalently:
$$g(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{1} - \frac{1}{2}\boldsymbol{\alpha}^\top \mathbf{Q} \boldsymbol{\alpha}$$
where $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$.
We can now state the complete dual optimization problem for hard margin SVM.
$$\begin{aligned} \max_{\boldsymbol{\alpha}} \quad & \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j \ \text{subject to} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,n \ & \sum{i=1}^n \alpha_i y_i = 0 \end{aligned}$$
Structure of the Dual:
| Component | Expression | Nature |
|---|---|---|
| Objective (linear part) | $\sum_i \alpha_i$ | Rewards assigning weight to examples |
| Objective (quadratic part) | $-\frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$ | Penalizes similar examples having large weights |
| Box constraints | $\alpha_i \geq 0$ | Dual feasibility |
| Equality constraint | $\sum_i \alpha_i y_i = 0$ | From stationarity wrt $b$ |
Converting to Minimization:
QP solvers typically minimize. Converting:
$$\begin{aligned} \min_{\boldsymbol{\alpha}} \quad & \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j - \sum_i \alpha_i \ \text{s.t.} \quad & \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0 \end{aligned}$$
Why This Is a QP:
The matrix $\mathbf{Q}$ with entries $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$ is central to the dual problem. Understanding its structure reveals important properties.
Decomposition of Q:
$$\mathbf{Q} = \mathbf{D}_y \mathbf{K} \mathbf{D}_y$$
where:
Since $\mathbf{D}_y^2 = \mathbf{I}$ (because $y_i^2 = 1$), we have $\mathbf{D}_y = \mathbf{D}_y^{-1}$.
Positive Semidefiniteness:
$\mathbf{K}$ is the Gram matrix of the data, hence positive semidefinite (eigenvalues $\geq 0$). Conjugation by $\mathbf{D}_y$ preserves eigenvalues since $\mathbf{D}_y$ is orthogonal (with $\pm 1$ entries):
$$\mathbf{Q} = \mathbf{D}_y \mathbf{K} \mathbf{D}_y \succeq 0$$
When Is Q Positive Definite?
$\mathbf{Q} \succ 0$ (strictly positive definite) if and only if $\mathbf{K} \succ 0$, which requires:
When $n > d$, $\mathbf{K}$ is rank-deficient, so $\mathbf{Q}$ is only PSD.
Replace $\mathbf{x}_i^\top\mathbf{x}_j$ with any kernel $k(\mathbf{x}_i, \mathbf{x}j)$, and the entire derivation still holds! The dual only needs the kernel matrix $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$, not the explicit features. This is the kernel trick.
Computing Q Efficiently:
$$\mathbf{Q} = (\mathbf{y} \odot \mathbf{X})(\mathbf{y} \odot \mathbf{X})^\top$$
where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the data matrix and $\mathbf{y} \odot \mathbf{X}$ broadcasts $y_i$ across row $i$.
Practical Considerations:
| Aspect | Consideration |
|---|---|
| Storage | $\mathbf{Q}$ is $n \times n$, requires $O(n^2)$ memory |
| Computation | $O(n^2 d)$ to compute full $\mathbf{Q}$ |
| Large $n$ | May not fit in memory; use iterative methods |
| SMO | Computes $Q_{ij}$ on demand, avoiding full storage |
| Caching | Frequently accessed $Q_{ij}$ cached for efficiency |
Once we solve the dual and obtain $\boldsymbol{\alpha}^$, we need to recover the primal variables $(\mathbf{w}^, b^*)$ for prediction.
Recovering $\mathbf{w}^*$:
Direct from the stationarity condition:
$$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i = \sum{i: \alpha_i^* > 0} \alpha_i^* y_i \mathbf{x}_i$$
In practice, only support vectors (where $\alpha_i^* > 0$) contribute.
Recovering $b^*$:
Use complementary slackness. For any support vector $k$ (where $\alpha_k^* > 0$):
$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1$$
Solving for $b^*$:
$$b^* = y_k - \mathbf{w}^{\top}\mathbf{x}k = y_k - \sum{i=1}^n \alpha_i^ y_i \mathbf{x}_i^\top\mathbf{x}_k$$
Numerical Stability:
Any single support vector gives valid $b^*$, but for numerical stability, average over all support vectors:
$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i^\top\mathbf{x}_k \right)$$
For a new point $\mathbf{x}$: $$f(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + b^ = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top \mathbf{x}) + b^*$$ $$\hat{y} = \text{sign}(f(\mathbf{x}))$$
Note: Only inner products $\mathbf{x}_i^\top\mathbf{x}$ are needed—no explicit $\mathbf{w}$ required!
Summary of Recovery:
| Quantity | Formula | Notes |
|---|---|---|
| $\mathbf{w}^*$ | $\sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i$ | Sum over support vectors |
| $b^*$ | $y_k - \mathbf{w}^{*\top}\mathbf{x}_k$ for any SV $k$ | Average over SVs for stability |
| $f(\mathbf{x})$ | $\sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$ | Only inner products needed |
Verification:
After recovery, verify:
The dual problem has a beautiful geometric interpretation that illuminates why SVMs work.
The Dual Objective's Structure:
$$W(\boldsymbol{\alpha}) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$$
Intuition:
The optimal solution balances:
Points that lie on the margin boundary. These are the 'critical' examples that define the decision boundary. Interior points (well within their class region) have $\alpha_i = 0$ because they're not needed to define the boundary. Only the points right at the margin 'support' the hyperplane.
Margin as a Force Balance:
Recall $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. Each support vector exerts a 'force' on the decision boundary:
The $\alpha_i$ values measure the 'strength' of each support vector's contribution.
Dual Feasible Region:
$${\boldsymbol{\alpha} : \alpha_i \geq 0, \sum_i \alpha_i y_i = 0}$$
This is a polyhedron in $\mathbb{R}^n$—the intersection of:
The dual optimal solution lies at a vertex of this polyhedron where the quadratic objective is maximized.
We have derived the complete dual formulation of hard margin SVM through the Lagrangian method. This derivation is the cornerstone of SVM theory.
The next page provides detailed examination of the dual problem itself—its structure, properties, and why it's often preferred over the primal. We'll see how the dual formulation enables kernel SVMs and connect to the complementary slackness conditions that define support vectors.
The Derivation Journey:
$$\text{Primal QP} \xrightarrow{\text{Lagrangian}} \text{Minimax} \xrightarrow{\text{Stationarity}} \text{Eliminate } \mathbf{w}, b \xrightarrow{\text{Substitute}} \text{Dual QP}$$
This Lagrangian derivation is a mathematical gem—transforming a constrained problem in $(\mathbf{w}, b)$ space into an equivalent problem in $\boldsymbol{\alpha}$ space. The dual reveals structure hidden in the primal: data appears only through inner products, sparsity emerges naturally, and the path to kernels becomes clear.