Loading learning content...
In the previous module, we developed the geometric intuition behind maximum margin classifiers. We saw that among all hyperplanes that correctly separate two classes, the maximum margin hyperplane achieves the greatest distance to the nearest data points—the support vectors. This margin provides a buffer zone that enhances generalization to unseen data.
Now we face a fundamental challenge: how do we actually find this optimal hyperplane? The answer lies in transforming our geometric intuition into a precise mathematical optimization problem. This page lays the complete mathematical foundation, taking you from first principles to the canonical hard margin SVM optimization formulation.
What makes this formulation remarkable is not just its elegance, but its computational tractability. We will see how a seemingly complex geometric problem reduces to a well-studied class of optimization problems with guaranteed solutions.
By the end of this page, you will understand: (1) the precise mathematical formulation of the maximum margin objective, (2) how margin maximization translates to a minimization problem, (3) why the constraints enforce correct classification, (4) the significance of the canonical form, and (5) why this formulation is computationally tractable.
Before deriving the optimization problem, we must establish precise notation and clearly state our assumptions. Mathematical rigor demands unambiguous definitions.
Training Data:
We have a training set of $n$ labeled examples:
$$\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$$
where:
The Separating Hyperplane:
A hyperplane in $\mathbb{R}^d$ is defined by:
$$\mathbf{w}^\top \mathbf{x} + b = 0$$
where:
The hyperplane divides the space into two half-spaces:
The choice of $y_i \in {-1, +1}$ rather than ${0, 1}$ is deliberate. It creates mathematical symmetry that simplifies derivations significantly. With this encoding, we can express correct classification elegantly: a point is correctly classified if and only if $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$. This single unified condition handles both classes without case splitting.
The Linear Separability Assumption:
For hard margin SVM, we assume the training data is linearly separable: there exists at least one hyperplane that correctly classifies all training points. Formally:
$$\exists , \mathbf{w}, b \text{ such that } y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0 \quad \forall , i = 1, \ldots, n$$
This assumption is fundamental to the hard margin formulation. When it fails (which happens frequently with real-world data), we must turn to the soft margin SVM, which we'll study in a later module.
Classification Rule:
Given a hyperplane defined by $(\mathbf{w}, b)$, we classify a new point $\mathbf{x}$ as:
$$\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b) = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}$$
| Symbol | Meaning | Domain |
|---|---|---|
| $\mathbf{x}_i$ | Feature vector of $i$-th training example | $\mathbb{R}^d$ |
| $y_i$ | Class label of $i$-th training example | ${-1, +1}$ |
| $\mathbf{w}$ | Normal vector to the hyperplane (weights) | $\mathbb{R}^d$ |
| $b$ | Bias term (offset) | $\mathbb{R}$ |
| $n$ | Number of training examples | $\mathbb{Z}^+$ |
| $d$ | Dimensionality of feature space | $\mathbb{Z}^+$ |
The margin concept is fundamentally geometric: it measures the distance from points to the separating hyperplane. To formulate margin maximization mathematically, we must first derive the formula for this distance.
Deriving the Distance Formula:
Consider a point $\mathbf{x}_0$ and a hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$. We want to find the perpendicular (shortest) distance from $\mathbf{x}_0$ to this hyperplane.
Step 1: Find a point on the hyperplane
Let $\mathbf{x}_p$ be the point on the hyperplane closest to $\mathbf{x}_0$. Since $\mathbf{x}_p$ lies on the hyperplane:
$$\mathbf{w}^\top \mathbf{x}_p + b = 0$$
Step 2: Express the relationship between $\mathbf{x}_0$ and $\mathbf{x}_p$
The vector from $\mathbf{x}_p$ to $\mathbf{x}_0$ must be perpendicular to the hyperplane, hence parallel to $\mathbf{w}$:
$$\mathbf{x}_0 - \mathbf{x}_p = r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}$$
where $r$ is the signed distance (positive if $\mathbf{x}_0$ is on the positive side of the hyperplane).
Step 3: Solve for the distance
Rearranging: $\mathbf{x}_p = \mathbf{x}_0 - r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}$
Substituting into the hyperplane equation:
$$\mathbf{w}^\top \left(\mathbf{x}_0 - r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}\right) + b = 0$$
$$\mathbf{w}^\top \mathbf{x}_0 - r \cdot \frac{\mathbf{w}^\top \mathbf{w}}{|\mathbf{w}|} + b = 0$$
Since $\mathbf{w}^\top \mathbf{w} = |\mathbf{w}|^2$:
$$\mathbf{w}^\top \mathbf{x}_0 - r \cdot |\mathbf{w}| + b = 0$$
Solving for $r$:
$$r = \frac{\mathbf{w}^\top \mathbf{x}_0 + b}{|\mathbf{w}|}$$
The signed distance from a point $\mathbf{x}_0$ to the hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ is:
$$\gamma = \frac{\mathbf{w}^\top \mathbf{x}_0 + b}{|\mathbf{w}|}$$
The absolute (unsigned) distance is $|\gamma| = \frac{|\mathbf{w}^\top \mathbf{x}_0 + b|}{|\mathbf{w}|}$.
Geometric Margin of a Training Example:
For a correctly classified training point $(\mathbf{x}_i, y_i)$, the geometric margin is the perpendicular distance to the hyperplane with the correct sign:
$$\gamma_i = \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|}$$
Notice that $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$ for correctly classified points, ensuring $\gamma_i > 0$.
Geometric Margin of the Classifier:
The overall geometric margin of the classifier is the minimum margin across all training points:
$$\gamma = \min_{i=1,\ldots,n} \gamma_i = \min_{i=1,\ldots,n} \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|}$$
This is the distance from the hyperplane to the nearest training point—the width of the 'margin band' that we want to maximize.
Our goal is clear: find the hyperplane $(\mathbf{w}, b)$ that maximizes the geometric margin $\gamma$ while correctly classifying all training points. This translates directly into an optimization problem.
Initial Formulation:
$$\begin{aligned} \max_{\mathbf{w}, b, \gamma} \quad & \gamma \ \text{subject to} \quad & \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|} \geq \gamma \quad \forall , i = 1, \ldots, n \end{aligned}$$
This formulation captures our intent: maximize the minimum distance $\gamma$ such that every point is at least $\gamma$ away from the hyperplane on the correct side.
The Problem with This Formulation:
While conceptually correct, this formulation is computationally problematic:
Non-convex constraint: The constraint involves $|\mathbf{w}|$ in the denominator, making the feasible region non-convex.
Infinite solutions: For any valid $(\mathbf{w}, b, \gamma)$, scaling by a constant $c > 0$ gives another valid solution $(c\mathbf{w}, cb, \gamma)$ with the same margin.
Hard to solve: Non-convex optimization problems can have multiple local optima and are generally NP-hard.
The naive margin maximization formulation is mathematically correct but computationally intractable. The brilliance of SVM theory lies in transforming this into an equivalent convex problem with a unique, efficiently computable solution.
Introducing the Functional Margin:
The geometric margin involves division by $|\mathbf{w}|$. To eliminate this, we introduce the functional margin:
$$\hat{\gamma}_i = y_i(\mathbf{w}^\top \mathbf{x}_i + b)$$
The relationship between functional and geometric margins is:
$$\gamma_i = \frac{\hat{\gamma}_i}{|\mathbf{w}|}$$
Unlike the geometric margin, the functional margin does change when we scale $(\mathbf{w}, b)$. This gives us exactly the degree of freedom we need to simplify the problem.
The key insight that unlocks the SVM formulation is recognizing that the scale of $(\mathbf{w}, b)$ is a free parameter. Since the geometric margin is scale-invariant, we can choose any convenient scale without changing the decision boundary.
The Canonical Choice:
We choose to scale $(\mathbf{w}, b)$ such that the minimum functional margin equals exactly 1:
$$\min_{i=1,\ldots,n} y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$$
This means the points closest to the hyperplane (the support vectors) have functional margin exactly 1:
$$y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1 \quad \text{for support vectors}$$ $$y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all points}$$
Setting the minimum functional margin to 1 is a normalization convention, not a constraint that limits the solution. For any hyperplane with minimum functional margin $\hat{\gamma}$, we can rescale to $(\mathbf{w}/\hat{\gamma}, b/\hat{\gamma})$ to achieve minimum functional margin 1. This rescaling doesn't change the decision boundary or the geometric margin.
Consequences of the Canonical Form:
With the canonical scaling, the geometric margin becomes elegantly simple:
$$\gamma = \frac{1}{|\mathbf{w}|}$$
This is remarkable! The margin is now a simple function of $|\mathbf{w}|$ alone.
The Simplified Maximization:
Maximizing the margin is equivalent to:
$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \Leftrightarrow \quad \min_{\mathbf{w}, b} |\mathbf{w}|$$
And since $|\mathbf{w}|$ and $|\mathbf{w}|^2$ are minimized at the same point (both are monotonically increasing in the magnitude of $\mathbf{w}$):
$$\min_{\mathbf{w}, b} |\mathbf{w}| \quad \Leftrightarrow \quad \min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$
The factor of $\frac{1}{2}$ is conventionally included to simplify the gradient (derivative of $\frac{1}{2}|\mathbf{w}|^2$ is just $\mathbf{w}$).
| Formulation | Expression | Reason for Equivalence |
|---|---|---|
| Maximize margin | $\max \gamma$ | Original geometric objective |
| Maximize $1/|\mathbf{w}|$ | $\max \frac{1}{|\mathbf{w}|}$ | In canonical form, $\gamma = 1/|\mathbf{w}|$ |
| Minimize $|\mathbf{w}|$ | $\min |\mathbf{w}|$ | Reciprocal of maximization |
| Minimize $\frac{1}{2}|\mathbf{w}|^2$ | $\min \frac{1}{2}|\mathbf{w}|^2$ | Monotonic transformation, cleaner derivatives |
We are now ready to state the complete optimization problem for hard margin SVM. This is called the primal problem (we will later derive an equivalent dual problem).
The Hard Margin SVM Primal Problem:
$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\mathbf{w}^\top \mathbf{w} \ \text{subject to} \quad & y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \forall , i = 1, \ldots, n \end{aligned}$$
This is a convex quadratic program: quadratic objective with linear constraints.
Dissecting the Formulation:
Objective Function: $\frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\sum_{j=1}^d w_j^2$
Constraints: $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ for all $i$
Decision Variables: $(\mathbf{w}, b) \in \mathbb{R}^d \times \mathbb{R}$
Why the Constraints Work:
Expanding $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$:
The elegant $y_i(\cdot)$ formulation unifies both cases into a single constraint type.
The constraints $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ define a 'margin band'—a region of width $2/|\mathbf{w}|$ where no training points are allowed.
The Three Hyperplanes:
The SVM solution defines three parallel hyperplanes:
Decision Boundary: $\mathbf{w}^\top \mathbf{x} + b = 0$
Positive Margin Hyperplane: $\mathbf{w}^\top \mathbf{x} + b = +1$
Negative Margin Hyperplane: $\mathbf{w}^\top \mathbf{x} + b = -1$
Total Margin Width:
The distance between the two margin hyperplanes is:
$$\text{Margin Width} = \frac{1 - (-1)}{|\mathbf{w}|} = \frac{2}{|\mathbf{w}|}$$
This is twice the geometric margin $\gamma = 1/|\mathbf{w}|$.
Support vectors are the training points that lie exactly on the margin hyperplanes—those for which $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$ (constraint is tight/active). They 'support' the margin band; removing any non-support-vector point would not change the optimal hyperplane.
Why the Empty Margin Band Matters:
The hard margin SVM enforces that no training points fall inside the margin band. This is precisely what the constraint $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ guarantees:
This empty margin band is the geometrical embodiment of the maximum margin principle.
| Region | Condition | Predicted Class | Training Points |
|---|---|---|---|
| Positive region | $\mathbf{w}^\top \mathbf{x} + b \geq 1$ | $+1$ | All $y=+1$ points |
| Margin band (positive side) | $0 \leq \mathbf{w}^\top \mathbf{x} + b < 1$ | $+1$ | Empty (no training points) |
| Margin band (negative side) | $-1 < \mathbf{w}^\top \mathbf{x} + b < 0$ | $-1$ | Empty (no training points) |
| Negative region | $\mathbf{w}^\top \mathbf{x} + b \leq -1$ | $-1$ | All $y=-1$ points |
Understanding the constraint structure provides insight into the geometry of the feasible region and the nature of the optimal solution.
Rewriting the Constraints:
Each constraint $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ can be written as:
$$-y_i(\mathbf{w}^\top \mathbf{x}_i + b) + 1 \leq 0$$
Define the constraint functions:
$$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)$$
The feasible region is:
$$\mathcal{F} = {(\mathbf{w}, b) : g_i(\mathbf{w}, b) \leq 0 \text{ for all } i}$$
Properties of the Feasible Region:
Active vs. Inactive Constraints:
At the optimal solution $(\mathbf{w}^, b^)$:
Why Support Vectors Determine the Solution:
The optimal hyperplane is determined entirely by the support vectors. Here's why:
This is the key to SVM's elegance: the solution depends only on a small subset of training points, making it robust and interpretable.
Among potentially millions of training points, typically only a small fraction are support vectors. This sparsity makes SVMs memory-efficient at test time and provides interpretability: you can examine the specific points that define the decision boundary.
For any optimization problem, we must establish whether a solution exists and whether it is unique. The hard margin SVM has strong theoretical guarantees.
Existence:
A solution exists if and only if the training data is linearly separable. When linear separability holds:
Uniqueness of $\mathbf{w}^*$:
The optimal weight vector $\mathbf{w}^*$ is unique because:
Uniqueness of $b^*$:
The optimal bias $b^*$ is also unique when:
$b^$ can be computed from any support vector: $$b^ = y_k - \mathbf{w}^{*\top} \mathbf{x}_k$$
where $k$ is any support vector index. For numerical stability, average over all support vectors.
Theorem: For linearly separable data, the hard margin SVM optimization problem has a unique optimal solution $(\mathbf{w}^, b^)$. This solution defines the maximum margin hyperplane, and any algorithm that finds it will produce identical results.
When Existence Fails:
If the data is not linearly separable:
This is not a numerical failure but a fundamental limitation. It's why soft margin SVM was developed: to handle non-separable data by allowing some constraint violations.
Detecting Infeasibility:
Quadratic programming solvers will detect infeasibility and report it. In practice:
The hard margin SVM formulation is not just theoretically elegant—it's computationally efficient. Understanding why requires appreciating its structure.
Convex Quadratic Program (QP):
The SVM primal is a convex QP:
$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}\mathbf{w}^\top \mathbf{w} \ \text{s.t.} \quad & \mathbf{A}\begin{pmatrix} \mathbf{w} \ b \end{pmatrix} \leq \mathbf{1} \end{aligned}$$
where $\mathbf{A}$ is the matrix of constraint coefficients derived from the data.
Standard QP Form:
Convex QPs have the general form: $$\min_\mathbf{z} \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \quad \text{s.t.} \quad \mathbf{A}\mathbf{z} \leq \mathbf{b}$$
For SVM: $\mathbf{z} = (\mathbf{w}, b)$, $\mathbf{Q}$ is identity padded with zeros for $b$, $\mathbf{c} = \mathbf{0}$.
Complexity Analysis:
Using interior point methods:
For large $n$ (many examples), the dual formulation is often preferred, as we'll see. For large $d$ (high-dimensional features), the primal may be more efficient.
This complexity analysis motivates the development of specialized algorithms like SMO (Sequential Minimal Optimization), which exploit problem structure for efficiency.
The tractability of the SVM optimization problem is why SVMs became practical. Unlike competing approaches that might require exhaustive search or have no convergence guarantees, SVM provides a unique, globally optimal solution in polynomial time. This mathematical elegance translates directly into reliable, predictable performance.
We have established the complete mathematical foundation for hard margin SVM. Let's consolidate what we've learned:
The primal formulation, while elegant, reveals only part of the SVM story. In the next page, we explore Quadratic Programming in depth—understanding the structure of QP problems, solution methods, and why the SVM QP has special properties that enable efficient solving. This sets the stage for the Lagrangian dual, which unlocks kernel methods.
The Journey So Far:
$$\text{Margin intuition} \rightarrow \text{Geometric margin formula} \rightarrow \text{Canonical normalization} \rightarrow \text{Convex QP formulation}$$
With the primal formulation established, we now possess the mathematical machinery to solve hard margin SVM. But the true power of SVMs—particularly their ability to handle nonlinear boundaries through kernels—emerges from the dual formulation, which we'll develop through Lagrangian methods.