Hard Margin Svm - Learning Module

Loading content...

0/245

Lagrangian Derivation

The Power of Lagrangian Methods

We have formulated hard margin SVM as a convex quadratic program and understood how general QP solvers approach such problems. Now we take a different path—one that reveals deep structure and ultimately enables the famous kernel trick.

The Lagrangian method transforms constrained optimization into unconstrained optimization by incorporating constraints into the objective via Lagrange multipliers. For convex problems, this leads to duality theory: the primal and dual problems are equivalent, providing two perspectives on the same solution.

This page develops the complete Lagrangian derivation of hard margin SVM. We will construct the Lagrangian, derive the conditions for optimality, and see how minimizing over primal variables yields an elegant dual formulation.

Learning Objectives

By the end of this page, you will: (1) understand the Lagrangian function and its role in constrained optimization, (2) construct the Lagrangian for hard margin SVM, (3) derive the stationarity conditions, (4) perform the minimization to eliminate primal variables, and (5) obtain the dual objective function.

Lagrangian Duality: The Big Picture

Before diving into SVM specifics, let's understand the general framework of Lagrangian duality.

The General Constrained Problem:

$$\begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \ \text{s.t.} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1,\ldots,m \ & h_j(\mathbf{x}) = 0, \quad j = 1,\ldots,p \end{aligned}$$

This is the primal problem. Let $p^*$ denote its optimal value.

The Lagrangian Function:

The Lagrangian incorporates constraints into the objective:

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \beta_j h_j(\mathbf{x})$$

where:

$\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_m)$ are Lagrange multipliers for inequality constraints
$\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)$ are multipliers for equality constraints
We require $\alpha_i \geq 0$ (dual feasibility for inequalities)

Intuition for Lagrange Multipliers

Lagrange multipliers act as "prices" for constraint violations. When $\alpha_i > 0$, any violation of constraint $g_i \leq 0$ increases the Lagrangian, penalizing infeasibility. When $\alpha_i = 0$, the constraint doesn't affect the objective—it's slack. This mirrors the economics interpretation where multipliers represent shadow prices.

The Dual Function:

The Lagrangian dual function minimizes the Lagrangian over primal variables:

$$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

This function is always concave (as an infimum of linear functions in $(\boldsymbol{\alpha}, \boldsymbol{\beta})$), regardless of whether the primal is convex.

The Dual Problem:

$$\begin{aligned} \max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} \quad & g(\boldsymbol{\alpha}, \boldsymbol{\beta}) \ \text{s.t.} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,m \end{aligned}$$

Let $d^*$ denote its optimal value.

Weak Duality:

$$d^* \leq p^*$$

The dual optimal value never exceeds the primal optimal value. This holds for any optimization problem.

Strong Duality:

For convex problems satisfying constraint qualifications (e.g., Slater's condition):

$$d^* = p^*$$

The primal and dual have the same optimal value. This is the remarkable result that makes duality powerful.

Constructing the SVM Lagrangian

Now we apply the Lagrangian framework to the hard margin SVM primal.

The Primal Problem (restated):

$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0, \quad i = 1,\ldots,n \end{aligned}$$

Note: We've rewritten $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1$ as $1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$ to match the standard form $g_i(\mathbf{x}) \leq 0$.

Introduce Lagrange Multipliers:

For each constraint $i = 1,\ldots,n$, we introduce a multiplier $\alpha_i \geq 0$.

The SVM Lagrangian:

Hard Margin SVM Lagrangian

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i \left[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 \right]$$

where $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n) \geq \mathbf{0}$ are the Lagrange multipliers.

Expanding the Lagrangian:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}i - b\sum{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i$$

Let's identify the components:

$\frac{1}{2}\mathbf{w}^\top\mathbf{w}$: Original objective (quadratic in $\mathbf{w}$)
$-\sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$: Linear in $\mathbf{w}$, couples with data
$-b\sum_{i=1}^n \alpha_i y_i$: Linear in $b$
$\sum_{i=1}^n \alpha_i$: Constant in primal variables

Why This Sign Convention?

We wrote the constraint as $g_i = 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$, so the Lagrangian is: $$f + \sum \alpha_i g_i = \frac{1}{2}|\mathbf{w}|^2 + \sum \alpha_i[1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)]$$

Rearranging gives the form above. The equivalence is verified by expanding.

Components of the SVM Lagrangian
Term	Expression	Type in $\mathbf{w}$	Type in $b$
Objective	$\frac{1}{2}\mathbf{w}^\top\mathbf{w}$	Quadratic	Constant
Data coupling	$-\sum \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$	Linear	Constant
Bias term	$-b\sum \alpha_i y_i$	Constant	Linear
Constant	$\sum \alpha_i$	Constant	Constant

The Primal as a Minimax Problem

The Lagrangian provides a minimax characterization of the primal problem.

Key Observation:

For any fixed $(\mathbf{w}, b)$:

$$\sup_{\boldsymbol{\alpha} \geq \mathbf{0}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \begin{cases} \frac{1}{2}|\mathbf{w}|^2 & \text{if } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \ \forall i \ +\infty & \text{otherwise} \end{cases}$$

Proof:

If $(\mathbf{w}, b)$ is feasible: All constraints satisfied, so $y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 \geq 0$.
- Terms $-\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1] \leq 0$ for $\alpha_i \geq 0$
- Maximum achieved at $\alpha_i = 0$ for all $i$, giving $\frac{1}{2}|\mathbf{w}|^2$
If $(\mathbf{w}, b)$ violates constraint $j$: $y_j(\mathbf{w}^\top\mathbf{x}_j + b) - 1 < 0$
- Term $-\alpha_j[y_j(\mathbf{w}^\top\mathbf{x}_j + b) - 1] > 0$ for $\alpha_j > 0$
- Taking $\alpha_j \to +\infty$ makes Lagrangian $\to +\infty$

The Minimax Formulation

The primal problem is equivalent to: $$p^* = \min_{\mathbf{w}, b} \sup_{\boldsymbol{\alpha} \geq \mathbf{0}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$$

The inner supremum enforces feasibility (infinite penalty for violations), while the outer minimization finds the optimal feasible point.

The Dual Problem:

Swapping the order of min and max:

$$d^* = \max_{\boldsymbol{\alpha} \geq \mathbf{0}} \min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$$

This is the dual problem. By weak duality: $d^* \leq p^*$.

Strong Duality for SVM:

The SVM primal is convex and satisfies Slater's condition (there exists a strictly feasible point when data is linearly separable). Therefore:

$$d^* = p^*$$

We can solve either problem to find the optimal solution!

Why Solve the Dual?

The dual has $n$ variables ($\alpha_i$) vs $d+1$ primal variables—may be smaller or larger depending on data
The dual enables the kernel trick: $\mathbf{x}_i$ appear only in inner products
Complementary slackness reveals support vectors directly
Dual structure yields insight into the solution geometry

Deriving the Stationarity Conditions

To find $\min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$, we set the partial derivatives to zero.

Derivative with Respect to $\mathbf{w}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i = \mathbf{0}$$

Solving for $\mathbf{w}$:

First Stationarity Condition

$$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$$

The optimal weight vector is a linear combination of the training points, weighted by $\alpha_i y_i$.

Derivative with Respect to $b$:

$$\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \alpha_i y_i = 0$$

Second Stationarity Condition

$$\sum_{i=1}^n \alpha_i y_i = 0$$

The weighted sum of labels must be zero. This is an equality constraint on the dual variables.

Physical Interpretation:

The condition $\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$ has profound implications:

Representer Theorem: The optimal $\mathbf{w}$ lies in the span of the training data. We don't need to search over all possible weight vectors—just linear combinations of data points.
Support Vector Sparsity: Since $\alpha_i = 0$ for non-support-vectors (by complementary slackness), $\mathbf{w}$ is a linear combination of only the support vectors: $$\mathbf{w} = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i$$
Class Balance: The condition $\sum \alpha_i y_i = 0$ means positive and negative class contributions balance. Since $y_i \in {-1, +1}$: $$\sum_{i: y_i = +1} \alpha_i = \sum_{i: y_i = -1} \alpha_i$$

Implications of Stationarity Conditions

•$\mathbf{w}$ as data expansion: Weight vector lives in the span of training examples
•Sparsity: Only support vectors contribute ($\alpha_i > 0$)
•Kernel enablement: $\mathbf{w}^\top\phi(\mathbf{x})$ becomes $\sum \alpha_i y_i k(\mathbf{x}_i, \mathbf{x})$
•Prediction formula: $f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i^\top \mathbf{x} + b$
•Memory efficiency: Store only support vectors for prediction

Substituting Back: Deriving the Dual Objective

With the stationarity conditions in hand, we substitute back into the Lagrangian to eliminate $\mathbf{w}$ and $b$, obtaining the dual objective.

Starting Point:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}i - b\sum{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i$$

Substituting $\mathbf{w} = \sum_j \alpha_j y_j \mathbf{x}_j$:

Term 1: $\frac{1}{2}\mathbf{w}^\top\mathbf{w}$

$$\frac{1}{2}\mathbf{w}^\top\mathbf{w} = \frac{1}{2}\left(\sum_i \alpha_i y_i \mathbf{x}_i\right)^\top \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)$$

$$= \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$

Term 2: $-\sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$

$$-\sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i = -\sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^\top \mathbf{x}_i$$

$$= -\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_j^\top \mathbf{x}_i = -\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$

Term 3: $-b\sum_i \alpha_i y_i$

By the second stationarity condition, $\sum_i \alpha_i y_i = 0$, so: $$-b\sum_i \alpha_i y_i = 0$$

Term 4: $\sum_i \alpha_i$

Remains as is.

The Cancellation

Notice that Term 2 is exactly $-2 \times$ Term 1! When we combine: $$\frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j - \sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j = -\frac{1}{2}\sum{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

Combining All Terms:

$$g(\boldsymbol{\alpha}) = \min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$$

This is the dual objective function $g(\boldsymbol{\alpha})$.

Matrix Notation:

Defining the Gram matrix $\mathbf{K}$ with entries $K_{ij} = \mathbf{x}_i^\top\mathbf{x}_j$, and the vector $\mathbf{y} = (y_1, \ldots, y_n)^\top$:

$$g(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{1} - \frac{1}{2}\boldsymbol{\alpha}^\top (\mathbf{y}\mathbf{y}^\top \odot \mathbf{K}) \boldsymbol{\alpha}$$

where $\odot$ denotes element-wise (Hadamard) product, or equivalently:

$$g(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{1} - \frac{1}{2}\boldsymbol{\alpha}^\top \mathbf{Q} \boldsymbol{\alpha}$$

where $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$.

The Complete Dual Problem

We can now state the complete dual optimization problem for hard margin SVM.

Hard Margin SVM Dual Problem

$$\begin{aligned} \max_{\boldsymbol{\alpha}} \quad & \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}j \ \text{subject to} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,n \ & \sum{i=1}^n \alpha_i y_i = 0 \end{aligned}$$

Structure of the Dual:

Component	Expression	Nature
Objective (linear part)	$\sum_i \alpha_i$	Rewards assigning weight to examples
Objective (quadratic part)	$-\frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$	Penalizes similar examples having large weights
Box constraints	$\alpha_i \geq 0$	Dual feasibility
Equality constraint	$\sum_i \alpha_i y_i = 0$	From stationarity wrt $b$

Converting to Minimization:

QP solvers typically minimize. Converting:

$$\begin{aligned} \min_{\boldsymbol{\alpha}} \quad & \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j - \sum_i \alpha_i \ \text{s.t.} \quad & \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0 \end{aligned}$$

Why This Is a QP:

Quadratic objective: $\frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha} - \mathbf{1}^\top\boldsymbol{\alpha}$ where $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$
Linear constraints: $\boldsymbol{\alpha} \geq \mathbf{0}$, $\mathbf{y}^\top\boldsymbol{\alpha} = 0$
$n$ variables, $n$ inequality constraints, 1 equality constraint

Dual Problem Properties

•Convex QP: $\mathbf{Q}$ is positive semidefinite (actually PSD by construction from inner products)
•$n$ variables: One multiplier per training example
•Bound constraints: $\alpha_i \geq 0$ are simple box constraints
•One equality constraint: $\sum \alpha_i y_i = 0$ couples all variables
•Data appears as inner products: $\mathbf{x}_i^\top\mathbf{x}_j$—key for kernels!

The Dual Q Matrix in Detail

The matrix $\mathbf{Q}$ with entries $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$ is central to the dual problem. Understanding its structure reveals important properties.

Decomposition of Q:

$$\mathbf{Q} = \mathbf{D}_y \mathbf{K} \mathbf{D}_y$$

where:

$\mathbf{D}_y = \text{diag}(y_1, \ldots, y_n)$ is a diagonal matrix of labels
$\mathbf{K}$ is the Gram matrix with $K_{ij} = \mathbf{x}_i^\top\mathbf{x}_j$

Since $\mathbf{D}_y^2 = \mathbf{I}$ (because $y_i^2 = 1$), we have $\mathbf{D}_y = \mathbf{D}_y^{-1}$.

Positive Semidefiniteness:

$\mathbf{K}$ is the Gram matrix of the data, hence positive semidefinite (eigenvalues $\geq 0$). Conjugation by $\mathbf{D}_y$ preserves eigenvalues since $\mathbf{D}_y$ is orthogonal (with $\pm 1$ entries):

$$\mathbf{Q} = \mathbf{D}_y \mathbf{K} \mathbf{D}_y \succeq 0$$

When Is Q Positive Definite?

$\mathbf{Q} \succ 0$ (strictly positive definite) if and only if $\mathbf{K} \succ 0$, which requires:

$n \leq d$ (at most as many examples as dimensions)
Data points $\mathbf{x}_1, \ldots, \mathbf{x}_n$ are linearly independent

When $n > d$, $\mathbf{K}$ is rank-deficient, so $\mathbf{Q}$ is only PSD.

Kernel Interpretation

Replace $\mathbf{x}_i^\top\mathbf{x}_j$ with any kernel $k(\mathbf{x}_i, \mathbf{x}j)$, and the entire derivation still holds! The dual only needs the kernel matrix $K{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$, not the explicit features. This is the kernel trick.

Computing Q Efficiently:

$$\mathbf{Q} = (\mathbf{y} \odot \mathbf{X})(\mathbf{y} \odot \mathbf{X})^\top$$

where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the data matrix and $\mathbf{y} \odot \mathbf{X}$ broadcasts $y_i$ across row $i$.

Practical Considerations:

Aspect	Consideration
Storage	$\mathbf{Q}$ is $n \times n$, requires $O(n^2)$ memory
Computation	$O(n^2 d)$ to compute full $\mathbf{Q}$
Large $n$	May not fit in memory; use iterative methods
SMO	Computes $Q_{ij}$ on demand, avoiding full storage
Caching	Frequently accessed $Q_{ij}$ cached for efficiency

Recovering Primal Variables from Dual Solution

Once we solve the dual and obtain $\boldsymbol{\alpha}^$, we need to recover the primal variables $(\mathbf{w}^, b^*)$ for prediction.

Recovering $\mathbf{w}^*$:

Direct from the stationarity condition:

Weight Vector Recovery

$$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i = \sum{i: \alpha_i^* > 0} \alpha_i^* y_i \mathbf{x}_i$$

In practice, only support vectors (where $\alpha_i^* > 0$) contribute.

Recovering $b^*$:

Use complementary slackness. For any support vector $k$ (where $\alpha_k^* > 0$):

$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1$$

Solving for $b^*$:

$$b^* = y_k - \mathbf{w}^{\top}\mathbf{x}k = y_k - \sum{i=1}^n \alpha_i^ y_i \mathbf{x}_i^\top\mathbf{x}_k$$

Numerical Stability:

Any single support vector gives valid $b^*$, but for numerical stability, average over all support vectors:

$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i^\top\mathbf{x}_k \right)$$

Prediction Formula

For a new point $\mathbf{x}$: $$f(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + b^ = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top \mathbf{x}) + b^*$$ $$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Note: Only inner products $\mathbf{x}_i^\top\mathbf{x}$ are needed—no explicit $\mathbf{w}$ required!

Summary of Recovery:

Quantity	Formula	Notes
$\mathbf{w}^*$	$\sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i$	Sum over support vectors
$b^*$	$y_k - \mathbf{w}^{*\top}\mathbf{x}_k$ for any SV $k$	Average over SVs for stability
$f(\mathbf{x})$	$\sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$	Only inner products needed

Verification:

After recovery, verify:

$\mathbf{w}^* = \sum \alpha_i^* y_i \mathbf{x}_i$ (stationarity)
$\sum \alpha_i^* y_i = 0$ (dual feasibility)
$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$ for all $i$ (primal feasibility)
$\alpha_i^[y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^*) - 1] = 0$ for all $i$ (complementary slackness)

Geometric Interpretation of the Dual

The dual problem has a beautiful geometric interpretation that illuminates why SVMs work.

The Dual Objective's Structure:

$$W(\boldsymbol{\alpha}) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$$

Linear term $\sum \alpha_i$: Encourages assigning positive weights to all examples
Quadratic term: Penalizes when two examples $i, j$ both have large $\alpha$ AND:
- Same class ($y_i = y_j$) AND similar (large $\mathbf{x}_i^\top\mathbf{x}_j$)
- Different class ($y_i \neq y_j$) AND anti-similar (large negative $y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$)

Intuition:

The optimal solution balances:

Wanting to include all examples (maximize $\sum \alpha_i$)
Penalizing redundancy (similar examples from same class shouldn't both be SVs)
Handling conflict (dissimilar examples from opposite classes cancel out)

Which Points Become Support Vectors?

Points that lie on the margin boundary. These are the 'critical' examples that define the decision boundary. Interior points (well within their class region) have $\alpha_i = 0$ because they're not needed to define the boundary. Only the points right at the margin 'support' the hyperplane.

Margin as a Force Balance:

Recall $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. Each support vector exerts a 'force' on the decision boundary:

Positive SVs 'push' the boundary toward the negative class
Negative SVs 'push' the boundary toward the positive class
The equilibrium ($\sum \alpha_i y_i = 0$) positions the boundary equidistant

The $\alpha_i$ values measure the 'strength' of each support vector's contribution.

Dual Feasible Region:

$${\boldsymbol{\alpha} : \alpha_i \geq 0, \sum_i \alpha_i y_i = 0}$$

This is a polyhedron in $\mathbb{R}^n$—the intersection of:

The non-negative orthant ($n$ half-spaces)
One hyperplane through the origin

The dual optimal solution lies at a vertex of this polyhedron where the quadratic objective is maximized.

Summary: The Lagrangian Foundation

We have derived the complete dual formulation of hard margin SVM through the Lagrangian method. This derivation is the cornerstone of SVM theory.

Key Takeaways

•Lagrangian function: Incorporates constraints via multipliers $\alpha_i$
•Stationarity gives $\mathbf{w}$: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ (weight as data expansion)
•Stationarity gives constraint: $\sum_i \alpha_i y_i = 0$ (class balance)
•Dual objective: $W(\boldsymbol{\alpha}) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$
•Dual is QP: With $n$ variables, box constraints, and one equality
•Strong duality: Primal and dual have same optimal value
•Data as inner products: Key enabler for kernels
•Recovery: $\mathbf{w}^, b^$ computed from $\boldsymbol{\alpha}^*$

What's Next

The next page provides detailed examination of the dual problem itself—its structure, properties, and why it's often preferred over the primal. We'll see how the dual formulation enables kernel SVMs and connect to the complementary slackness conditions that define support vectors.

The Derivation Journey:

$$\text{Primal QP} \xrightarrow{\text{Lagrangian}} \text{Minimax} \xrightarrow{\text{Stationarity}} \text{Eliminate } \mathbf{w}, b \xrightarrow{\text{Substitute}} \text{Dual QP}$$

This Lagrangian derivation is a mathematical gem—transforming a constrained problem in $(\mathbf{w}, b)$ space into an equivalent problem in $\boldsymbol{\alpha}$ space. The dual reveals structure hidden in the primal: data appears only through inner products, sparsity emerges naturally, and the path to kernels becomes clear.

Lagrangian Derivation

The Power of Lagrangian Methods

Learning Objectives

Lagrangian Duality: The Big Picture

Before diving into SVM specifics, let's understand the general framework of Lagrangian duality.

The General Constrained Problem:

$$\begin{aligned} \min_{\mathbf{x}} \quad & f(\mathbf{x}) \ \text{s.t.} \quad & g_i(\mathbf{x}) \leq 0, \quad i = 1,\ldots,m \ & h_j(\mathbf{x}) = 0, \quad j = 1,\ldots,p \end{aligned}$$

This is the primal problem. Let $p^*$ denote its optimal value.

The Lagrangian Function:

The Lagrangian incorporates constraints into the objective:

$$\mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{j=1}^p \beta_j h_j(\mathbf{x})$$

where:

$\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_m)$ are Lagrange multipliers for inequality constraints
$\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)$ are multipliers for equality constraints
We require $\alpha_i \geq 0$ (dual feasibility for inequalities)

Intuition for Lagrange Multipliers

The Dual Function:

The Lagrangian dual function minimizes the Lagrangian over primal variables:

$$g(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta})$$

This function is always concave (as an infimum of linear functions in $(\boldsymbol{\alpha}, \boldsymbol{\beta})$), regardless of whether the primal is convex.

The Dual Problem:

$$\begin{aligned} \max_{\boldsymbol{\alpha}, \boldsymbol{\beta}} \quad & g(\boldsymbol{\alpha}, \boldsymbol{\beta}) \ \text{s.t.} \quad & \alpha_i \geq 0, \quad i = 1,\ldots,m \end{aligned}$$

Let $d^*$ denote its optimal value.

Weak Duality:

$$d^* \leq p^*$$

The dual optimal value never exceeds the primal optimal value. This holds for any optimization problem.

Strong Duality:

For convex problems satisfying constraint qualifications (e.g., Slater's condition):

$$d^* = p^*$$

The primal and dual have the same optimal value. This is the remarkable result that makes duality powerful.

Constructing the SVM Lagrangian

Now we apply the Lagrangian framework to the hard margin SVM primal.

The Primal Problem (restated):

$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0, \quad i = 1,\ldots,n \end{aligned}$$

Note: We've rewritten $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1$ as $1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b) \leq 0$ to match the standard form $g_i(\mathbf{x}) \leq 0$.

Introduce Lagrange Multipliers:

For each constraint $i = 1,\ldots,n$, we introduce a multiplier $\alpha_i \geq 0$.

The SVM Lagrangian:

Hard Margin SVM Lagrangian

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}|\mathbf{w}|^2 - \sum_{i=1}^n \alpha_i \left[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 \right]$$

where $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n) \geq \mathbf{0}$ are the Lagrange multipliers.

Expanding the Lagrangian:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}i - b\sum{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i$$

Let's identify the components:

$\frac{1}{2}\mathbf{w}^\top\mathbf{w}$: Original objective (quadratic in $\mathbf{w}$)
$-\sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$: Linear in $\mathbf{w}$, couples with data
$-b\sum_{i=1}^n \alpha_i y_i$: Linear in $b$
$\sum_{i=1}^n \alpha_i$: Constant in primal variables

Why This Sign Convention?

Rearranging gives the form above. The equivalence is verified by expanding.

Components of the SVM Lagrangian
Term	Expression	Type in $\mathbf{w}$	Type in $b$
Objective	$\frac{1}{2}\mathbf{w}^\top\mathbf{w}$	Quadratic	Constant
Data coupling	$-\sum \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$	Linear	Constant
Bias term	$-b\sum \alpha_i y_i$	Constant	Linear
Constant	$\sum \alpha_i$	Constant	Constant

The Primal as a Minimax Problem

The Lagrangian provides a minimax characterization of the primal problem.

Key Observation:

For any fixed $(\mathbf{w}, b)$:

Proof:

If $(\mathbf{w}, b)$ is feasible: All constraints satisfied, so $y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 \geq 0$.
- Terms $-\alpha_i[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1] \leq 0$ for $\alpha_i \geq 0$
- Maximum achieved at $\alpha_i = 0$ for all $i$, giving $\frac{1}{2}|\mathbf{w}|^2$
If $(\mathbf{w}, b)$ violates constraint $j$: $y_j(\mathbf{w}^\top\mathbf{x}_j + b) - 1 < 0$
- Term $-\alpha_j[y_j(\mathbf{w}^\top\mathbf{x}_j + b) - 1] > 0$ for $\alpha_j > 0$
- Taking $\alpha_j \to +\infty$ makes Lagrangian $\to +\infty$

The Minimax Formulation

The primal problem is equivalent to: $$p^* = \min_{\mathbf{w}, b} \sup_{\boldsymbol{\alpha} \geq \mathbf{0}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$$

The inner supremum enforces feasibility (infinite penalty for violations), while the outer minimization finds the optimal feasible point.

The Dual Problem:

Swapping the order of min and max:

$$d^* = \max_{\boldsymbol{\alpha} \geq \mathbf{0}} \min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$$

This is the dual problem. By weak duality: $d^* \leq p^*$.

Strong Duality for SVM:

The SVM primal is convex and satisfies Slater's condition (there exists a strictly feasible point when data is linearly separable). Therefore:

$$d^* = p^*$$

We can solve either problem to find the optimal solution!

Why Solve the Dual?

The dual has $n$ variables ($\alpha_i$) vs $d+1$ primal variables—may be smaller or larger depending on data
The dual enables the kernel trick: $\mathbf{x}_i$ appear only in inner products
Complementary slackness reveals support vectors directly
Dual structure yields insight into the solution geometry

Deriving the Stationarity Conditions

To find $\min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})$, we set the partial derivatives to zero.

Derivative with Respect to $\mathbf{w}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i = \mathbf{0}$$

Solving for $\mathbf{w}$:

First Stationarity Condition

$$\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$$

The optimal weight vector is a linear combination of the training points, weighted by $\alpha_i y_i$.

Derivative with Respect to $b$:

$$\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \alpha_i y_i = 0$$

Second Stationarity Condition

$$\sum_{i=1}^n \alpha_i y_i = 0$$

The weighted sum of labels must be zero. This is an equality constraint on the dual variables.

Physical Interpretation:

The condition $\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i$ has profound implications:

Representer Theorem: The optimal $\mathbf{w}$ lies in the span of the training data. We don't need to search over all possible weight vectors—just linear combinations of data points.
Support Vector Sparsity: Since $\alpha_i = 0$ for non-support-vectors (by complementary slackness), $\mathbf{w}$ is a linear combination of only the support vectors: $$\mathbf{w} = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i$$
Class Balance: The condition $\sum \alpha_i y_i = 0$ means positive and negative class contributions balance. Since $y_i \in {-1, +1}$: $$\sum_{i: y_i = +1} \alpha_i = \sum_{i: y_i = -1} \alpha_i$$

Implications of Stationarity Conditions

•$\mathbf{w}$ as data expansion: Weight vector lives in the span of training examples
•Sparsity: Only support vectors contribute ($\alpha_i > 0$)
•Kernel enablement: $\mathbf{w}^\top\phi(\mathbf{x})$ becomes $\sum \alpha_i y_i k(\mathbf{x}_i, \mathbf{x})$
•Prediction formula: $f(\mathbf{x}) = \sum_{i \in SV} \alpha_i y_i \mathbf{x}_i^\top \mathbf{x} + b$
•Memory efficiency: Store only support vectors for prediction

Substituting Back: Deriving the Dual Objective

With the stationarity conditions in hand, we substitute back into the Lagrangian to eliminate $\mathbf{w}$ and $b$, obtaining the dual objective.

Starting Point:

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\mathbf{w}^\top\mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{w}^\top\mathbf{x}i - b\sum{i=1}^n \alpha_i y_i + \sum_{i=1}^n \alpha_i$$

Substituting $\mathbf{w} = \sum_j \alpha_j y_j \mathbf{x}_j$:

Term 1: $\frac{1}{2}\mathbf{w}^\top\mathbf{w}$

$$\frac{1}{2}\mathbf{w}^\top\mathbf{w} = \frac{1}{2}\left(\sum_i \alpha_i y_i \mathbf{x}_i\right)^\top \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)$$

$$= \frac{1}{2}\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$

Term 2: $-\sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i$

$$-\sum_i \alpha_i y_i \mathbf{w}^\top\mathbf{x}_i = -\sum_i \alpha_i y_i \left(\sum_j \alpha_j y_j \mathbf{x}_j\right)^\top \mathbf{x}_i$$

$$= -\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_j^\top \mathbf{x}_i = -\sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$

Term 3: $-b\sum_i \alpha_i y_i$

By the second stationarity condition, $\sum_i \alpha_i y_i = 0$, so: $$-b\sum_i \alpha_i y_i = 0$$

Term 4: $\sum_i \alpha_i$

Remains as is.

The Cancellation

Combining All Terms:

This is the dual objective function $g(\boldsymbol{\alpha})$.

Matrix Notation:

Defining the Gram matrix $\mathbf{K}$ with entries $K_{ij} = \mathbf{x}_i^\top\mathbf{x}_j$, and the vector $\mathbf{y} = (y_1, \ldots, y_n)^\top$:

$$g(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{1} - \frac{1}{2}\boldsymbol{\alpha}^\top (\mathbf{y}\mathbf{y}^\top \odot \mathbf{K}) \boldsymbol{\alpha}$$

where $\odot$ denotes element-wise (Hadamard) product, or equivalently:

$$g(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^\top \mathbf{1} - \frac{1}{2}\boldsymbol{\alpha}^\top \mathbf{Q} \boldsymbol{\alpha}$$

where $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$.

The Complete Dual Problem

We can now state the complete dual optimization problem for hard margin SVM.

Hard Margin SVM Dual Problem

Structure of the Dual:

Component	Expression	Nature
Objective (linear part)	$\sum_i \alpha_i$	Rewards assigning weight to examples
Objective (quadratic part)	$-\frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$	Penalizes similar examples having large weights
Box constraints	$\alpha_i \geq 0$	Dual feasibility
Equality constraint	$\sum_i \alpha_i y_i = 0$	From stationarity wrt $b$

Converting to Minimization:

QP solvers typically minimize. Converting:

Why This Is a QP:

Quadratic objective: $\frac{1}{2}\boldsymbol{\alpha}^\top\mathbf{Q}\boldsymbol{\alpha} - \mathbf{1}^\top\boldsymbol{\alpha}$ where $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$
Linear constraints: $\boldsymbol{\alpha} \geq \mathbf{0}$, $\mathbf{y}^\top\boldsymbol{\alpha} = 0$
$n$ variables, $n$ inequality constraints, 1 equality constraint

Dual Problem Properties

•Convex QP: $\mathbf{Q}$ is positive semidefinite (actually PSD by construction from inner products)
•$n$ variables: One multiplier per training example
•Bound constraints: $\alpha_i \geq 0$ are simple box constraints
•One equality constraint: $\sum \alpha_i y_i = 0$ couples all variables
•Data appears as inner products: $\mathbf{x}_i^\top\mathbf{x}_j$—key for kernels!

The Dual Q Matrix in Detail

The matrix $\mathbf{Q}$ with entries $Q_{ij} = y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$ is central to the dual problem. Understanding its structure reveals important properties.

Decomposition of Q:

$$\mathbf{Q} = \mathbf{D}_y \mathbf{K} \mathbf{D}_y$$

where:

$\mathbf{D}_y = \text{diag}(y_1, \ldots, y_n)$ is a diagonal matrix of labels
$\mathbf{K}$ is the Gram matrix with $K_{ij} = \mathbf{x}_i^\top\mathbf{x}_j$

Since $\mathbf{D}_y^2 = \mathbf{I}$ (because $y_i^2 = 1$), we have $\mathbf{D}_y = \mathbf{D}_y^{-1}$.

Positive Semidefiniteness:

$$\mathbf{Q} = \mathbf{D}_y \mathbf{K} \mathbf{D}_y \succeq 0$$

When Is Q Positive Definite?

$\mathbf{Q} \succ 0$ (strictly positive definite) if and only if $\mathbf{K} \succ 0$, which requires:

$n \leq d$ (at most as many examples as dimensions)
Data points $\mathbf{x}_1, \ldots, \mathbf{x}_n$ are linearly independent

When $n > d$, $\mathbf{K}$ is rank-deficient, so $\mathbf{Q}$ is only PSD.

Kernel Interpretation

Computing Q Efficiently:

$$\mathbf{Q} = (\mathbf{y} \odot \mathbf{X})(\mathbf{y} \odot \mathbf{X})^\top$$

where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the data matrix and $\mathbf{y} \odot \mathbf{X}$ broadcasts $y_i$ across row $i$.

Practical Considerations:

Aspect	Consideration
Storage	$\mathbf{Q}$ is $n \times n$, requires $O(n^2)$ memory
Computation	$O(n^2 d)$ to compute full $\mathbf{Q}$
Large $n$	May not fit in memory; use iterative methods
SMO	Computes $Q_{ij}$ on demand, avoiding full storage
Caching	Frequently accessed $Q_{ij}$ cached for efficiency

Recovering Primal Variables from Dual Solution

Once we solve the dual and obtain $\boldsymbol{\alpha}^$, we need to recover the primal variables $(\mathbf{w}^, b^*)$ for prediction.

Recovering $\mathbf{w}^*$:

Direct from the stationarity condition:

Weight Vector Recovery

$$\mathbf{w}^* = \sum_{i=1}^n \alpha_i^* y_i \mathbf{x}i = \sum{i: \alpha_i^* > 0} \alpha_i^* y_i \mathbf{x}_i$$

In practice, only support vectors (where $\alpha_i^* > 0$) contribute.

Recovering $b^*$:

Use complementary slackness. For any support vector $k$ (where $\alpha_k^* > 0$):

$$y_k(\mathbf{w}^{\top}\mathbf{x}_k + b^) = 1$$

Solving for $b^*$:

$$b^* = y_k - \mathbf{w}^{\top}\mathbf{x}k = y_k - \sum{i=1}^n \alpha_i^ y_i \mathbf{x}_i^\top\mathbf{x}_k$$

Numerical Stability:

Any single support vector gives valid $b^*$, but for numerical stability, average over all support vectors:

$$b^* = \frac{1}{|SV|} \sum_{k \in SV} \left( y_k - \sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i^\top\mathbf{x}_k \right)$$

Prediction Formula

For a new point $\mathbf{x}$: $$f(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + b^ = \sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top \mathbf{x}) + b^*$$ $$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Note: Only inner products $\mathbf{x}_i^\top\mathbf{x}$ are needed—no explicit $\mathbf{w}$ required!

Summary of Recovery:

Quantity	Formula	Notes
$\mathbf{w}^*$	$\sum_{i \in SV} \alpha_i^* y_i \mathbf{x}_i$	Sum over support vectors
$b^*$	$y_k - \mathbf{w}^{*\top}\mathbf{x}_k$ for any SV $k$	Average over SVs for stability
$f(\mathbf{x})$	$\sum_{i \in SV} \alpha_i^* y_i (\mathbf{x}_i^\top\mathbf{x}) + b^*$	Only inner products needed

Verification:

After recovery, verify:

$\mathbf{w}^* = \sum \alpha_i^* y_i \mathbf{x}_i$ (stationarity)
$\sum \alpha_i^* y_i = 0$ (dual feasibility)
$y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$ for all $i$ (primal feasibility)
$\alpha_i^[y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^*) - 1] = 0$ for all $i$ (complementary slackness)

Geometric Interpretation of the Dual

The dual problem has a beautiful geometric interpretation that illuminates why SVMs work.

The Dual Objective's Structure:

$$W(\boldsymbol{\alpha}) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j)$$

Linear term $\sum \alpha_i$: Encourages assigning positive weights to all examples
Quadratic term: Penalizes when two examples $i, j$ both have large $\alpha$ AND:
- Same class ($y_i = y_j$) AND similar (large $\mathbf{x}_i^\top\mathbf{x}_j$)
- Different class ($y_i \neq y_j$) AND anti-similar (large negative $y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$)

Intuition:

The optimal solution balances:

Wanting to include all examples (maximize $\sum \alpha_i$)
Penalizing redundancy (similar examples from same class shouldn't both be SVs)
Handling conflict (dissimilar examples from opposite classes cancel out)

Which Points Become Support Vectors?

Margin as a Force Balance:

Recall $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. Each support vector exerts a 'force' on the decision boundary:

Positive SVs 'push' the boundary toward the negative class
Negative SVs 'push' the boundary toward the positive class
The equilibrium ($\sum \alpha_i y_i = 0$) positions the boundary equidistant

The $\alpha_i$ values measure the 'strength' of each support vector's contribution.

Dual Feasible Region:

$${\boldsymbol{\alpha} : \alpha_i \geq 0, \sum_i \alpha_i y_i = 0}$$

This is a polyhedron in $\mathbb{R}^n$—the intersection of:

The non-negative orthant ($n$ half-spaces)
One hyperplane through the origin

The dual optimal solution lies at a vertex of this polyhedron where the quadratic objective is maximized.

Summary: The Lagrangian Foundation

We have derived the complete dual formulation of hard margin SVM through the Lagrangian method. This derivation is the cornerstone of SVM theory.

Key Takeaways

•Lagrangian function: Incorporates constraints via multipliers $\alpha_i$
•Stationarity gives $\mathbf{w}$: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ (weight as data expansion)
•Stationarity gives constraint: $\sum_i \alpha_i y_i = 0$ (class balance)
•Dual objective: $W(\boldsymbol{\alpha}) = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j$
•Dual is QP: With $n$ variables, box constraints, and one equality
•Strong duality: Primal and dual have same optimal value
•Data as inner products: Key enabler for kernels
•Recovery: $\mathbf{w}^, b^$ computed from $\boldsymbol{\alpha}^*$

What's Next

The Derivation Journey:

$$\text{Primal QP} \xrightarrow{\text{Lagrangian}} \text{Minimax} \xrightarrow{\text{Stationarity}} \text{Eliminate } \mathbf{w}, b \xrightarrow{\text{Substitute}} \text{Dual QP}$$