Machine LearningSupport Vector Machines

Hard Margin SVM

LevelAdvanced

Duration90 mins

TopicSupport Vector Machines

1 / 5

Optimization Problem Formulation

From Geometric Intuition to Mathematical Precision

In the previous module, we developed the geometric intuition behind maximum margin classifiers. We saw that among all hyperplanes that correctly separate two classes, the maximum margin hyperplane achieves the greatest distance to the nearest data points—the support vectors. This margin provides a buffer zone that enhances generalization to unseen data.

Now we face a fundamental challenge: how do we actually find this optimal hyperplane? The answer lies in transforming our geometric intuition into a precise mathematical optimization problem. This page lays the complete mathematical foundation, taking you from first principles to the canonical hard margin SVM optimization formulation.

What makes this formulation remarkable is not just its elegance, but its computational tractability. We will see how a seemingly complex geometric problem reduces to a well-studied class of optimization problems with guaranteed solutions.

What You Will Master

By the end of this page, you will understand: (1) the precise mathematical formulation of the maximum margin objective, (2) how margin maximization translates to a minimization problem, (3) why the constraints enforce correct classification, (4) the significance of the canonical form, and (5) why this formulation is computationally tractable.

Problem Setup and Notation

Before deriving the optimization problem, we must establish precise notation and clearly state our assumptions. Mathematical rigor demands unambiguous definitions.

Training Data:

We have a training set of $n$ labeled examples:

$$\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$$

where:

$\mathbf{x}_i \in \mathbb{R}^d$ is the feature vector for the $i$-th example
$y_i \in {-1, +1}$ is the class label
$d$ is the dimensionality of the feature space
$n$ is the number of training examples

The Separating Hyperplane:

A hyperplane in $\mathbb{R}^d$ is defined by:

$$\mathbf{w}^\top \mathbf{x} + b = 0$$

where:

$\mathbf{w} \in \mathbb{R}^d$ is the normal vector (perpendicular to the hyperplane)
$b \in \mathbb{R}$ is the bias term (determines the offset from the origin)

The hyperplane divides the space into two half-spaces:

Points where $\mathbf{w}^\top \mathbf{x} + b > 0$ (positive side)
Points where $\mathbf{w}^\top \mathbf{x} + b < 0$ (negative side)

Why ±1 Labels?

The choice of $y_i \in {-1, +1}$ rather than ${0, 1}$ is deliberate. It creates mathematical symmetry that simplifies derivations significantly. With this encoding, we can express correct classification elegantly: a point is correctly classified if and only if $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$. This single unified condition handles both classes without case splitting.

The Linear Separability Assumption:

For hard margin SVM, we assume the training data is linearly separable: there exists at least one hyperplane that correctly classifies all training points. Formally:

$$\exists , \mathbf{w}, b \text{ such that } y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0 \quad \forall , i = 1, \ldots, n$$

This assumption is fundamental to the hard margin formulation. When it fails (which happens frequently with real-world data), we must turn to the soft margin SVM, which we'll study in a later module.

Classification Rule:

Given a hyperplane defined by $(\mathbf{w}, b)$, we classify a new point $\mathbf{x}$ as:

$$\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b) = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}$$

Summary of Key Notation
Symbol	Meaning	Domain
$\mathbf{x}_i$	Feature vector of $i$-th training example	$\mathbb{R}^d$
$y_i$	Class label of $i$-th training example	${-1, +1}$
$\mathbf{w}$	Normal vector to the hyperplane (weights)	$\mathbb{R}^d$
$b$	Bias term (offset)	$\mathbb{R}$
$n$	Number of training examples	$\mathbb{Z}^+$
$d$	Dimensionality of feature space	$\mathbb{Z}^+$

Geometric Distance to a Hyperplane

The margin concept is fundamentally geometric: it measures the distance from points to the separating hyperplane. To formulate margin maximization mathematically, we must first derive the formula for this distance.

Deriving the Distance Formula:

Consider a point $\mathbf{x}_0$ and a hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$. We want to find the perpendicular (shortest) distance from $\mathbf{x}_0$ to this hyperplane.

Step 1: Find a point on the hyperplane

Let $\mathbf{x}_p$ be the point on the hyperplane closest to $\mathbf{x}_0$. Since $\mathbf{x}_p$ lies on the hyperplane:

$$\mathbf{w}^\top \mathbf{x}_p + b = 0$$

Step 2: Express the relationship between $\mathbf{x}_0$ and $\mathbf{x}_p$

The vector from $\mathbf{x}_p$ to $\mathbf{x}_0$ must be perpendicular to the hyperplane, hence parallel to $\mathbf{w}$:

$$\mathbf{x}_0 - \mathbf{x}_p = r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}$$

where $r$ is the signed distance (positive if $\mathbf{x}_0$ is on the positive side of the hyperplane).

Step 3: Solve for the distance

Rearranging: $\mathbf{x}_p = \mathbf{x}_0 - r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}$

Substituting into the hyperplane equation:

$$\mathbf{w}^\top \left(\mathbf{x}_0 - r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}\right) + b = 0$$

$$\mathbf{w}^\top \mathbf{x}_0 - r \cdot \frac{\mathbf{w}^\top \mathbf{w}}{|\mathbf{w}|} + b = 0$$

Since $\mathbf{w}^\top \mathbf{w} = |\mathbf{w}|^2$:

$$\mathbf{w}^\top \mathbf{x}_0 - r \cdot |\mathbf{w}| + b = 0$$

Solving for $r$:

$$r = \frac{\mathbf{w}^\top \mathbf{x}_0 + b}{|\mathbf{w}|}$$

The Distance Formula

The signed distance from a point $\mathbf{x}_0$ to the hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ is:

$$\gamma = \frac{\mathbf{w}^\top \mathbf{x}_0 + b}{|\mathbf{w}|}$$

The absolute (unsigned) distance is $|\gamma| = \frac{|\mathbf{w}^\top \mathbf{x}_0 + b|}{|\mathbf{w}|}$.

Geometric Margin of a Training Example:

For a correctly classified training point $(\mathbf{x}_i, y_i)$, the geometric margin is the perpendicular distance to the hyperplane with the correct sign:

$$\gamma_i = \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|}$$

Notice that $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$ for correctly classified points, ensuring $\gamma_i > 0$.

Geometric Margin of the Classifier:

The overall geometric margin of the classifier is the minimum margin across all training points:

$$\gamma = \min_{i=1,\ldots,n} \gamma_i = \min_{i=1,\ldots,n} \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|}$$

This is the distance from the hyperplane to the nearest training point—the width of the 'margin band' that we want to maximize.

Key Properties of Geometric Margin

•Scale Invariance: The geometric margin is invariant to scaling of $(\mathbf{w}, b)$. If we replace $(\mathbf{w}, b)$ with $(c\mathbf{w}, cb)$ for any $c > 0$, the geometric margin remains unchanged.
•Positivity for Correct Classification: $\gamma_i > 0$ if and only if the point is correctly classified.
•Geometric Interpretation: $\gamma$ is literally the width of the margin band—the buffer zone where no training points lie.
•Normalization by $|\mathbf{w}|$: The denominator $|\mathbf{w}|$ normalizes the functional margin to produce a true geometric distance.

The Margin Maximization Objective

Our goal is clear: find the hyperplane $(\mathbf{w}, b)$ that maximizes the geometric margin $\gamma$ while correctly classifying all training points. This translates directly into an optimization problem.

Initial Formulation:

$$\begin{aligned} \max_{\mathbf{w}, b, \gamma} \quad & \gamma \ \text{subject to} \quad & \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|} \geq \gamma \quad \forall , i = 1, \ldots, n \end{aligned}$$

This formulation captures our intent: maximize the minimum distance $\gamma$ such that every point is at least $\gamma$ away from the hyperplane on the correct side.

The Problem with This Formulation:

While conceptually correct, this formulation is computationally problematic:

Non-convex constraint: The constraint involves $|\mathbf{w}|$ in the denominator, making the feasible region non-convex.
Infinite solutions: For any valid $(\mathbf{w}, b, \gamma)$, scaling by a constant $c > 0$ gives another valid solution $(c\mathbf{w}, cb, \gamma)$ with the same margin.
Hard to solve: Non-convex optimization problems can have multiple local optima and are generally NP-hard.

Why We Need a Better Formulation

The naive margin maximization formulation is mathematically correct but computationally intractable. The brilliance of SVM theory lies in transforming this into an equivalent convex problem with a unique, efficiently computable solution.

Introducing the Functional Margin:

The geometric margin involves division by $|\mathbf{w}|$. To eliminate this, we introduce the functional margin:

$$\hat{\gamma}_i = y_i(\mathbf{w}^\top \mathbf{x}_i + b)$$

The relationship between functional and geometric margins is:

$$\gamma_i = \frac{\hat{\gamma}_i}{|\mathbf{w}|}$$

Unlike the geometric margin, the functional margin does change when we scale $(\mathbf{w}, b)$. This gives us exactly the degree of freedom we need to simplify the problem.

The Canonical Representation

The key insight that unlocks the SVM formulation is recognizing that the scale of $(\mathbf{w}, b)$ is a free parameter. Since the geometric margin is scale-invariant, we can choose any convenient scale without changing the decision boundary.

The Canonical Choice:

We choose to scale $(\mathbf{w}, b)$ such that the minimum functional margin equals exactly 1:

$$\min_{i=1,\ldots,n} y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$$

This means the points closest to the hyperplane (the support vectors) have functional margin exactly 1:

$$y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1 \quad \text{for support vectors}$$ $$y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all points}$$

Why Set the Minimum to 1?

Setting the minimum functional margin to 1 is a normalization convention, not a constraint that limits the solution. For any hyperplane with minimum functional margin $\hat{\gamma}$, we can rescale to $(\mathbf{w}/\hat{\gamma}, b/\hat{\gamma})$ to achieve minimum functional margin 1. This rescaling doesn't change the decision boundary or the geometric margin.

Consequences of the Canonical Form:

With the canonical scaling, the geometric margin becomes elegantly simple:

$$\gamma = \frac{1}{|\mathbf{w}|}$$

This is remarkable! The margin is now a simple function of $|\mathbf{w}|$ alone.

The Simplified Maximization:

Maximizing the margin is equivalent to:

$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \Leftrightarrow \quad \min_{\mathbf{w}, b} |\mathbf{w}|$$

And since $|\mathbf{w}|$ and $|\mathbf{w}|^2$ are minimized at the same point (both are monotonically increasing in the magnitude of $\mathbf{w}$):

$$\min_{\mathbf{w}, b} |\mathbf{w}| \quad \Leftrightarrow \quad \min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$

The factor of $\frac{1}{2}$ is conventionally included to simplify the gradient (derivative of $\frac{1}{2}|\mathbf{w}|^2$ is just $\mathbf{w}$).

Equivalent Formulations of Margin Maximization
Formulation	Expression	Reason for Equivalence
Maximize margin	$\max \gamma$	Original geometric objective
Maximize $1/\|\mathbf{w}\|$	$\max \frac{1}{\|\mathbf{w}\|}$	In canonical form, $\gamma = 1/\|\mathbf{w}\|$
Minimize $\|\mathbf{w}\|$	$\min \|\mathbf{w}\|$	Reciprocal of maximization
Minimize $\frac{1}{2}\|\mathbf{w}\|^2$	$\min \frac{1}{2}\|\mathbf{w}\|^2$	Monotonic transformation, cleaner derivatives

The Complete Primal Formulation

We are now ready to state the complete optimization problem for hard margin SVM. This is called the primal problem (we will later derive an equivalent dual problem).

The Hard Margin SVM Primal Problem:

Hard Margin SVM Optimization Problem

$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\mathbf{w}^\top \mathbf{w} \ \text{subject to} \quad & y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \forall , i = 1, \ldots, n \end{aligned}$$

This is a convex quadratic program: quadratic objective with linear constraints.

Dissecting the Formulation:

Objective Function: $\frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\sum_{j=1}^d w_j^2$

This is a convex quadratic function in $\mathbf{w}$
It has a unique global minimum (at $\mathbf{w} = \mathbf{0}$, absent constraints)
Minimizing this maximizes the margin $\gamma = 1/|\mathbf{w}|$

Constraints: $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ for all $i$

These are linear inequality constraints in $(\mathbf{w}, b)$
Each constraint is an affine function, hence convex
Together they ensure correct classification with margin at least $1/|\mathbf{w}|$

Decision Variables: $(\mathbf{w}, b) \in \mathbb{R}^d \times \mathbb{R}$

$d + 1$ continuous variables
$n$ inequality constraints

Why the Constraints Work:

Expanding $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$:

For $y_i = +1$: $\mathbf{w}^\top \mathbf{x}_i + b \geq 1$ (positive class must be distance $\geq 1/|\mathbf{w}|$ on positive side)
For $y_i = -1$: $\mathbf{w}^\top \mathbf{x}_i + b \leq -1$ (negative class must be distance $\geq 1/|\mathbf{w}|$ on negative side)

The elegant $y_i(\cdot)$ formulation unifies both cases into a single constraint type.

Properties of the Primal Problem

•Convexity: Both the objective and constraints are convex, guaranteeing a unique global optimum.
•Quadratic Program (QP): Falls into a well-studied class of optimization problems with efficient algorithms.
•Polynomial Time Solvable: Interior point methods solve this in $O(n^3)$ time.
•Differentiability: The objective is continuously differentiable, enabling gradient-based methods.
•Strong Duality: Lagrangian duality theory guarantees zero duality gap (primal and dual have the same optimal value).

Understanding the Margin Band

The constraints $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ define a 'margin band'—a region of width $2/|\mathbf{w}|$ where no training points are allowed.

The Three Hyperplanes:

The SVM solution defines three parallel hyperplanes:

Decision Boundary: $\mathbf{w}^\top \mathbf{x} + b = 0$
- The actual classification boundary
- Points here have zero functional margin
Positive Margin Hyperplane: $\mathbf{w}^\top \mathbf{x} + b = +1$
- Defines the boundary of the positive class margin
- Positive support vectors lie on this hyperplane
Negative Margin Hyperplane: $\mathbf{w}^\top \mathbf{x} + b = -1$
- Defines the boundary of the negative class margin
- Negative support vectors lie on this hyperplane

Total Margin Width:

The distance between the two margin hyperplanes is:

$$\text{Margin Width} = \frac{1 - (-1)}{|\mathbf{w}|} = \frac{2}{|\mathbf{w}|}$$

This is twice the geometric margin $\gamma = 1/|\mathbf{w}|$.

Support Vectors Defined

Support vectors are the training points that lie exactly on the margin hyperplanes—those for which $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$ (constraint is tight/active). They 'support' the margin band; removing any non-support-vector point would not change the optimal hyperplane.

Why the Empty Margin Band Matters:

The hard margin SVM enforces that no training points fall inside the margin band. This is precisely what the constraint $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ guarantees:

All $y_i = +1$ points have $\mathbf{w}^\top \mathbf{x}_i + b \geq 1$ (on or beyond positive margin hyperplane)
All $y_i = -1$ points have $\mathbf{w}^\top \mathbf{x}_i + b \leq -1$ (on or beyond negative margin hyperplane)
The region $-1 < \mathbf{w}^\top \mathbf{x} + b < 1$ contains no training points

This empty margin band is the geometrical embodiment of the maximum margin principle.

Classification Regions
Region	Condition	Predicted Class	Training Points
Positive region	$\mathbf{w}^\top \mathbf{x} + b \geq 1$	$+1$	All $y=+1$ points
Margin band (positive side)	$0 \leq \mathbf{w}^\top \mathbf{x} + b < 1$	$+1$	Empty (no training points)
Margin band (negative side)	$-1 < \mathbf{w}^\top \mathbf{x} + b < 0$	$-1$	Empty (no training points)
Negative region	$\mathbf{w}^\top \mathbf{x} + b \leq -1$	$-1$	All $y=-1$ points

Analyzing the Constraint Structure

Understanding the constraint structure provides insight into the geometry of the feasible region and the nature of the optimal solution.

Rewriting the Constraints:

Each constraint $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ can be written as:

$$-y_i(\mathbf{w}^\top \mathbf{x}_i + b) + 1 \leq 0$$

Define the constraint functions:

$$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)$$

The feasible region is:

$$\mathcal{F} = {(\mathbf{w}, b) : g_i(\mathbf{w}, b) \leq 0 \text{ for all } i}$$

Properties of the Feasible Region:

Feasible Region Geometry

•Polyhedron: The feasible region is a convex polyhedron—the intersection of $n$ half-spaces in $\mathbb{R}^{d+1}$.
•Bounded below by Origin: The objective $|\mathbf{w}|^2$ is bounded below by 0, achieved only at $\mathbf{w} = \mathbf{0}$.
•Non-empty (SEP Assumption): Linear separability guarantees the polyhedron is non-empty.
•Bounded Optimal Set: Under linear separability, the optimal solution has non-zero $\mathbf{w}$, and the optimal set is a single point.

Active vs. Inactive Constraints:

At the optimal solution $(\mathbf{w}^, b^)$:

Active constraint (tight): $y_i(\mathbf{w}^{\top} \mathbf{x}_i + b^) = 1$ — the point lies on a margin hyperplane (support vector)
Inactive constraint (slack): $y_i(\mathbf{w}^{\top} \mathbf{x}_i + b^) > 1$ — the point is strictly beyond the margin (non-support-vector)

Why Support Vectors Determine the Solution:

The optimal hyperplane is determined entirely by the support vectors. Here's why:

Moving the hyperplane slightly (changing $\mathbf{w}$ or $b$) would violate active constraints
Inactive constraints have slack and don't affect the optimal solution
If we removed all non-support-vectors and re-solved, we'd get the same answer

This is the key to SVM's elegance: the solution depends only on a small subset of training points, making it robust and interpretable.

The Sparsity Principle

Among potentially millions of training points, typically only a small fraction are support vectors. This sparsity makes SVMs memory-efficient at test time and provides interpretability: you can examine the specific points that define the decision boundary.

Existence and Uniqueness of the Solution

For any optimization problem, we must establish whether a solution exists and whether it is unique. The hard margin SVM has strong theoretical guarantees.

Existence:

A solution exists if and only if the training data is linearly separable. When linear separability holds:

The feasible region $\mathcal{F}$ is non-empty
The objective $\frac{1}{2}|\mathbf{w}|^2$ is continuous and coercive ($\to \infty$ as $|\mathbf{w}| \to \infty$)
By the Weierstrass extreme value theorem applied to the restricted domain, an optimal solution exists

Uniqueness of $\mathbf{w}^*$:

The optimal weight vector $\mathbf{w}^*$ is unique because:

The objective $\frac{1}{2}|\mathbf{w}|^2$ is strictly convex in $\mathbf{w}$
The feasible region is convex
A strictly convex function over a convex set has at most one minimizer

Uniqueness of $b^*$:

The optimal bias $b^*$ is also unique when:

There exist support vectors from both classes
This is generically true unless the data is degenerate

$b^$ can be computed from any support vector: $$b^ = y_k - \mathbf{w}^{*\top} \mathbf{x}_k$$

where $k$ is any support vector index. For numerical stability, average over all support vectors.

Theoretical Guarantee

Theorem: For linearly separable data, the hard margin SVM optimization problem has a unique optimal solution $(\mathbf{w}^, b^)$. This solution defines the maximum margin hyperplane, and any algorithm that finds it will produce identical results.

When Existence Fails:

If the data is not linearly separable:

The feasible region $\mathcal{F}$ is empty
No hyperplane can satisfy all constraints
The hard margin SVM problem is infeasible

This is not a numerical failure but a fundamental limitation. It's why soft margin SVM was developed: to handle non-separable data by allowing some constraint violations.

Detecting Infeasibility:

Quadratic programming solvers will detect infeasibility and report it. In practice:

Always check for solver convergence status
Infeasibility suggests hard margin SVM is inappropriate
Consider soft margin SVM, kernel methods, or re-examining the data

Why This Formulation Is Tractable

The hard margin SVM formulation is not just theoretically elegant—it's computationally efficient. Understanding why requires appreciating its structure.

Convex Quadratic Program (QP):

The SVM primal is a convex QP:

$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}\mathbf{w}^\top \mathbf{w} \ \text{s.t.} \quad & \mathbf{A}\begin{pmatrix} \mathbf{w} \ b \end{pmatrix} \leq \mathbf{1} \end{aligned}$$

where $\mathbf{A}$ is the matrix of constraint coefficients derived from the data.

Standard QP Form:

Convex QPs have the general form: $$\min_\mathbf{z} \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \quad \text{s.t.} \quad \mathbf{A}\mathbf{z} \leq \mathbf{b}$$

For SVM: $\mathbf{z} = (\mathbf{w}, b)$, $\mathbf{Q}$ is identity padded with zeros for $b$, $\mathbf{c} = \mathbf{0}$.

Computational Properties

•Polynomial Time: Convex QPs are solvable in polynomial time using interior point methods, with complexity $O(n^3)$ per iteration and typically $O(\sqrt{n})$ iterations.
•Global Optimum Guaranteed: Convexity ensures any local optimum is the global optimum—no risk of getting stuck in suboptimal solutions.
•Numerical Stability: Well-conditioned problem structure makes numerical solution reliable.
•Mature Algorithms: Decades of research have produced highly optimized QP solvers (CPLEX, Gurobi, MOSEK, OSQP).
•Duality: The dual problem (covered next) provides alternative solution methods and important theoretical insights.

Complexity Analysis:

Variables: $d + 1$ (feature dimension plus bias)
Constraints: $n$ (one per training example)

Using interior point methods:

Time: $O((d+n)^3)$ per iteration, typically $O(\log(1/\epsilon))$ iterations for $\epsilon$-accuracy
Space: $O((d+n)^2)$ for constraint matrix storage

For large $n$ (many examples), the dual formulation is often preferred, as we'll see. For large $d$ (high-dimensional features), the primal may be more efficient.

This complexity analysis motivates the development of specialized algorithms like SMO (Sequential Minimal Optimization), which exploit problem structure for efficiency.

Practical Implications

The tractability of the SVM optimization problem is why SVMs became practical. Unlike competing approaches that might require exhaustive search or have no convergence guarantees, SVM provides a unique, globally optimal solution in polynomial time. This mathematical elegance translates directly into reliable, predictable performance.

Summary: The Optimization Foundation

We have established the complete mathematical foundation for hard margin SVM. Let's consolidate what we've learned:

Key Takeaways

•Geometric margin measures the perpendicular distance from points to the hyperplane: $\gamma = y(\mathbf{w}^\top\mathbf{x} + b)/|\mathbf{w}|$
•Canonical form sets the minimum functional margin to 1, simplifying margin to $\gamma = 1/|\mathbf{w}|$
•Margin maximization is equivalent to minimizing $\frac{1}{2}|\mathbf{w}|^2$
•The primal problem is a convex quadratic program with linear inequality constraints
•Feasibility requires linear separability; infeasibility indicates hard margin is inappropriate
•Uniqueness of the optimal solution is guaranteed by strict convexity
•Polynomial-time solvability makes the problem computationally tractable

What's Next

The primal formulation, while elegant, reveals only part of the SVM story. In the next page, we explore Quadratic Programming in depth—understanding the structure of QP problems, solution methods, and why the SVM QP has special properties that enable efficient solving. This sets the stage for the Lagrangian dual, which unlocks kernel methods.

The Journey So Far:

$$\text{Margin intuition} \rightarrow \text{Geometric margin formula} \rightarrow \text{Canonical normalization} \rightarrow \text{Convex QP formulation}$$

With the primal formulation established, we now possess the mathematical machinery to solve hard margin SVM. But the true power of SVMs—particularly their ability to handle nonlinear boundaries through kernels—emerges from the dual formulation, which we'll develop through Lagrangian methods.

1 / 5

Loading learning content...

Machine LearningSupport Vector Machines

Hard Margin SVM

LevelAdvanced

Duration90 mins

TopicSupport Vector Machines

1 / 5

Optimization Problem Formulation

From Geometric Intuition to Mathematical Precision

What You Will Master

Problem Setup and Notation

Before deriving the optimization problem, we must establish precise notation and clearly state our assumptions. Mathematical rigor demands unambiguous definitions.

Training Data:

We have a training set of $n$ labeled examples:

$$\mathcal{D} = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$$

where:

$\mathbf{x}_i \in \mathbb{R}^d$ is the feature vector for the $i$-th example
$y_i \in {-1, +1}$ is the class label
$d$ is the dimensionality of the feature space
$n$ is the number of training examples

The Separating Hyperplane:

A hyperplane in $\mathbb{R}^d$ is defined by:

$$\mathbf{w}^\top \mathbf{x} + b = 0$$

where:

$\mathbf{w} \in \mathbb{R}^d$ is the normal vector (perpendicular to the hyperplane)
$b \in \mathbb{R}$ is the bias term (determines the offset from the origin)

The hyperplane divides the space into two half-spaces:

Points where $\mathbf{w}^\top \mathbf{x} + b > 0$ (positive side)
Points where $\mathbf{w}^\top \mathbf{x} + b < 0$ (negative side)

Why ±1 Labels?

The Linear Separability Assumption:

For hard margin SVM, we assume the training data is linearly separable: there exists at least one hyperplane that correctly classifies all training points. Formally:

$$\exists , \mathbf{w}, b \text{ such that } y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0 \quad \forall , i = 1, \ldots, n$$

This assumption is fundamental to the hard margin formulation. When it fails (which happens frequently with real-world data), we must turn to the soft margin SVM, which we'll study in a later module.

Classification Rule:

Given a hyperplane defined by $(\mathbf{w}, b)$, we classify a new point $\mathbf{x}$ as:

$$\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b) = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}$$

Summary of Key Notation
Symbol	Meaning	Domain
$\mathbf{x}_i$	Feature vector of $i$-th training example	$\mathbb{R}^d$
$y_i$	Class label of $i$-th training example	${-1, +1}$
$\mathbf{w}$	Normal vector to the hyperplane (weights)	$\mathbb{R}^d$
$b$	Bias term (offset)	$\mathbb{R}$
$n$	Number of training examples	$\mathbb{Z}^+$
$d$	Dimensionality of feature space	$\mathbb{Z}^+$

Geometric Distance to a Hyperplane

Deriving the Distance Formula:

Consider a point $\mathbf{x}_0$ and a hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$. We want to find the perpendicular (shortest) distance from $\mathbf{x}_0$ to this hyperplane.

Step 1: Find a point on the hyperplane

Let $\mathbf{x}_p$ be the point on the hyperplane closest to $\mathbf{x}_0$. Since $\mathbf{x}_p$ lies on the hyperplane:

$$\mathbf{w}^\top \mathbf{x}_p + b = 0$$

Step 2: Express the relationship between $\mathbf{x}_0$ and $\mathbf{x}_p$

The vector from $\mathbf{x}_p$ to $\mathbf{x}_0$ must be perpendicular to the hyperplane, hence parallel to $\mathbf{w}$:

$$\mathbf{x}_0 - \mathbf{x}_p = r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}$$

where $r$ is the signed distance (positive if $\mathbf{x}_0$ is on the positive side of the hyperplane).

Step 3: Solve for the distance

Rearranging: $\mathbf{x}_p = \mathbf{x}_0 - r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}$

Substituting into the hyperplane equation:

$$\mathbf{w}^\top \left(\mathbf{x}_0 - r \cdot \frac{\mathbf{w}}{|\mathbf{w}|}\right) + b = 0$$

$$\mathbf{w}^\top \mathbf{x}_0 - r \cdot \frac{\mathbf{w}^\top \mathbf{w}}{|\mathbf{w}|} + b = 0$$

Since $\mathbf{w}^\top \mathbf{w} = |\mathbf{w}|^2$:

$$\mathbf{w}^\top \mathbf{x}_0 - r \cdot |\mathbf{w}| + b = 0$$

Solving for $r$:

$$r = \frac{\mathbf{w}^\top \mathbf{x}_0 + b}{|\mathbf{w}|}$$

The Distance Formula

The signed distance from a point $\mathbf{x}_0$ to the hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ is:

$$\gamma = \frac{\mathbf{w}^\top \mathbf{x}_0 + b}{|\mathbf{w}|}$$

The absolute (unsigned) distance is $|\gamma| = \frac{|\mathbf{w}^\top \mathbf{x}_0 + b|}{|\mathbf{w}|}$.

Geometric Margin of a Training Example:

For a correctly classified training point $(\mathbf{x}_i, y_i)$, the geometric margin is the perpendicular distance to the hyperplane with the correct sign:

$$\gamma_i = \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|}$$

Notice that $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$ for correctly classified points, ensuring $\gamma_i > 0$.

Geometric Margin of the Classifier:

The overall geometric margin of the classifier is the minimum margin across all training points:

$$\gamma = \min_{i=1,\ldots,n} \gamma_i = \min_{i=1,\ldots,n} \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{|\mathbf{w}|}$$

This is the distance from the hyperplane to the nearest training point—the width of the 'margin band' that we want to maximize.

Key Properties of Geometric Margin

•Scale Invariance: The geometric margin is invariant to scaling of $(\mathbf{w}, b)$. If we replace $(\mathbf{w}, b)$ with $(c\mathbf{w}, cb)$ for any $c > 0$, the geometric margin remains unchanged.
•Positivity for Correct Classification: $\gamma_i > 0$ if and only if the point is correctly classified.
•Geometric Interpretation: $\gamma$ is literally the width of the margin band—the buffer zone where no training points lie.
•Normalization by $|\mathbf{w}|$: The denominator $|\mathbf{w}|$ normalizes the functional margin to produce a true geometric distance.

The Margin Maximization Objective

Initial Formulation:

This formulation captures our intent: maximize the minimum distance $\gamma$ such that every point is at least $\gamma$ away from the hyperplane on the correct side.

The Problem with This Formulation:

While conceptually correct, this formulation is computationally problematic:

Non-convex constraint: The constraint involves $|\mathbf{w}|$ in the denominator, making the feasible region non-convex.
Infinite solutions: For any valid $(\mathbf{w}, b, \gamma)$, scaling by a constant $c > 0$ gives another valid solution $(c\mathbf{w}, cb, \gamma)$ with the same margin.
Hard to solve: Non-convex optimization problems can have multiple local optima and are generally NP-hard.

Why We Need a Better Formulation

Introducing the Functional Margin:

The geometric margin involves division by $|\mathbf{w}|$. To eliminate this, we introduce the functional margin:

$$\hat{\gamma}_i = y_i(\mathbf{w}^\top \mathbf{x}_i + b)$$

The relationship between functional and geometric margins is:

$$\gamma_i = \frac{\hat{\gamma}_i}{|\mathbf{w}|}$$

Unlike the geometric margin, the functional margin does change when we scale $(\mathbf{w}, b)$. This gives us exactly the degree of freedom we need to simplify the problem.

The Canonical Representation

The Canonical Choice:

We choose to scale $(\mathbf{w}, b)$ such that the minimum functional margin equals exactly 1:

$$\min_{i=1,\ldots,n} y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$$

This means the points closest to the hyperplane (the support vectors) have functional margin exactly 1:

$$y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1 \quad \text{for support vectors}$$ $$y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all points}$$

Why Set the Minimum to 1?

Consequences of the Canonical Form:

With the canonical scaling, the geometric margin becomes elegantly simple:

$$\gamma = \frac{1}{|\mathbf{w}|}$$

This is remarkable! The margin is now a simple function of $|\mathbf{w}|$ alone.

The Simplified Maximization:

Maximizing the margin is equivalent to:

$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \Leftrightarrow \quad \min_{\mathbf{w}, b} |\mathbf{w}|$$

And since $|\mathbf{w}|$ and $|\mathbf{w}|^2$ are minimized at the same point (both are monotonically increasing in the magnitude of $\mathbf{w}$):

$$\min_{\mathbf{w}, b} |\mathbf{w}| \quad \Leftrightarrow \quad \min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$

The factor of $\frac{1}{2}$ is conventionally included to simplify the gradient (derivative of $\frac{1}{2}|\mathbf{w}|^2$ is just $\mathbf{w}$).

Equivalent Formulations of Margin Maximization
Formulation	Expression	Reason for Equivalence
Maximize margin	$\max \gamma$	Original geometric objective
Maximize $1/\|\mathbf{w}\|$	$\max \frac{1}{\|\mathbf{w}\|}$	In canonical form, $\gamma = 1/\|\mathbf{w}\|$
Minimize $\|\mathbf{w}\|$	$\min \|\mathbf{w}\|$	Reciprocal of maximization
Minimize $\frac{1}{2}\|\mathbf{w}\|^2$	$\min \frac{1}{2}\|\mathbf{w}\|^2$	Monotonic transformation, cleaner derivatives

The Complete Primal Formulation

We are now ready to state the complete optimization problem for hard margin SVM. This is called the primal problem (we will later derive an equivalent dual problem).

The Hard Margin SVM Primal Problem:

Hard Margin SVM Optimization Problem

This is a convex quadratic program: quadratic objective with linear constraints.

Dissecting the Formulation:

Objective Function: $\frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\sum_{j=1}^d w_j^2$

This is a convex quadratic function in $\mathbf{w}$
It has a unique global minimum (at $\mathbf{w} = \mathbf{0}$, absent constraints)
Minimizing this maximizes the margin $\gamma = 1/|\mathbf{w}|$

Constraints: $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ for all $i$

These are linear inequality constraints in $(\mathbf{w}, b)$
Each constraint is an affine function, hence convex
Together they ensure correct classification with margin at least $1/|\mathbf{w}|$

Decision Variables: $(\mathbf{w}, b) \in \mathbb{R}^d \times \mathbb{R}$

$d + 1$ continuous variables
$n$ inequality constraints

Why the Constraints Work:

Expanding $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$:

For $y_i = +1$: $\mathbf{w}^\top \mathbf{x}_i + b \geq 1$ (positive class must be distance $\geq 1/|\mathbf{w}|$ on positive side)
For $y_i = -1$: $\mathbf{w}^\top \mathbf{x}_i + b \leq -1$ (negative class must be distance $\geq 1/|\mathbf{w}|$ on negative side)

The elegant $y_i(\cdot)$ formulation unifies both cases into a single constraint type.

Properties of the Primal Problem

•Convexity: Both the objective and constraints are convex, guaranteeing a unique global optimum.
•Quadratic Program (QP): Falls into a well-studied class of optimization problems with efficient algorithms.
•Polynomial Time Solvable: Interior point methods solve this in $O(n^3)$ time.
•Differentiability: The objective is continuously differentiable, enabling gradient-based methods.
•Strong Duality: Lagrangian duality theory guarantees zero duality gap (primal and dual have the same optimal value).

Understanding the Margin Band

The constraints $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ define a 'margin band'—a region of width $2/|\mathbf{w}|$ where no training points are allowed.

The Three Hyperplanes:

The SVM solution defines three parallel hyperplanes:

Decision Boundary: $\mathbf{w}^\top \mathbf{x} + b = 0$
- The actual classification boundary
- Points here have zero functional margin
Positive Margin Hyperplane: $\mathbf{w}^\top \mathbf{x} + b = +1$
- Defines the boundary of the positive class margin
- Positive support vectors lie on this hyperplane
Negative Margin Hyperplane: $\mathbf{w}^\top \mathbf{x} + b = -1$
- Defines the boundary of the negative class margin
- Negative support vectors lie on this hyperplane

Total Margin Width:

The distance between the two margin hyperplanes is:

$$\text{Margin Width} = \frac{1 - (-1)}{|\mathbf{w}|} = \frac{2}{|\mathbf{w}|}$$

This is twice the geometric margin $\gamma = 1/|\mathbf{w}|$.

Support Vectors Defined

Why the Empty Margin Band Matters:

The hard margin SVM enforces that no training points fall inside the margin band. This is precisely what the constraint $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ guarantees:

All $y_i = +1$ points have $\mathbf{w}^\top \mathbf{x}_i + b \geq 1$ (on or beyond positive margin hyperplane)
All $y_i = -1$ points have $\mathbf{w}^\top \mathbf{x}_i + b \leq -1$ (on or beyond negative margin hyperplane)
The region $-1 < \mathbf{w}^\top \mathbf{x} + b < 1$ contains no training points

This empty margin band is the geometrical embodiment of the maximum margin principle.

Classification Regions
Region	Condition	Predicted Class	Training Points
Positive region	$\mathbf{w}^\top \mathbf{x} + b \geq 1$	$+1$	All $y=+1$ points
Margin band (positive side)	$0 \leq \mathbf{w}^\top \mathbf{x} + b < 1$	$+1$	Empty (no training points)
Margin band (negative side)	$-1 < \mathbf{w}^\top \mathbf{x} + b < 0$	$-1$	Empty (no training points)
Negative region	$\mathbf{w}^\top \mathbf{x} + b \leq -1$	$-1$	All $y=-1$ points

Analyzing the Constraint Structure

Understanding the constraint structure provides insight into the geometry of the feasible region and the nature of the optimal solution.

Rewriting the Constraints:

Each constraint $y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1$ can be written as:

$$-y_i(\mathbf{w}^\top \mathbf{x}_i + b) + 1 \leq 0$$

Define the constraint functions:

$$g_i(\mathbf{w}, b) = 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)$$

The feasible region is:

$$\mathcal{F} = {(\mathbf{w}, b) : g_i(\mathbf{w}, b) \leq 0 \text{ for all } i}$$

Properties of the Feasible Region:

Feasible Region Geometry

•Polyhedron: The feasible region is a convex polyhedron—the intersection of $n$ half-spaces in $\mathbb{R}^{d+1}$.
•Bounded below by Origin: The objective $|\mathbf{w}|^2$ is bounded below by 0, achieved only at $\mathbf{w} = \mathbf{0}$.
•Non-empty (SEP Assumption): Linear separability guarantees the polyhedron is non-empty.
•Bounded Optimal Set: Under linear separability, the optimal solution has non-zero $\mathbf{w}$, and the optimal set is a single point.

Active vs. Inactive Constraints:

At the optimal solution $(\mathbf{w}^, b^)$:

Active constraint (tight): $y_i(\mathbf{w}^{\top} \mathbf{x}_i + b^) = 1$ — the point lies on a margin hyperplane (support vector)
Inactive constraint (slack): $y_i(\mathbf{w}^{\top} \mathbf{x}_i + b^) > 1$ — the point is strictly beyond the margin (non-support-vector)

Why Support Vectors Determine the Solution:

The optimal hyperplane is determined entirely by the support vectors. Here's why:

Moving the hyperplane slightly (changing $\mathbf{w}$ or $b$) would violate active constraints
Inactive constraints have slack and don't affect the optimal solution
If we removed all non-support-vectors and re-solved, we'd get the same answer

This is the key to SVM's elegance: the solution depends only on a small subset of training points, making it robust and interpretable.

The Sparsity Principle

Existence and Uniqueness of the Solution

For any optimization problem, we must establish whether a solution exists and whether it is unique. The hard margin SVM has strong theoretical guarantees.

Existence:

A solution exists if and only if the training data is linearly separable. When linear separability holds:

The feasible region $\mathcal{F}$ is non-empty
The objective $\frac{1}{2}|\mathbf{w}|^2$ is continuous and coercive ($\to \infty$ as $|\mathbf{w}| \to \infty$)
By the Weierstrass extreme value theorem applied to the restricted domain, an optimal solution exists

Uniqueness of $\mathbf{w}^*$:

The optimal weight vector $\mathbf{w}^*$ is unique because:

The objective $\frac{1}{2}|\mathbf{w}|^2$ is strictly convex in $\mathbf{w}$
The feasible region is convex
A strictly convex function over a convex set has at most one minimizer

Uniqueness of $b^*$:

The optimal bias $b^*$ is also unique when:

There exist support vectors from both classes
This is generically true unless the data is degenerate

$b^$ can be computed from any support vector: $$b^ = y_k - \mathbf{w}^{*\top} \mathbf{x}_k$$

where $k$ is any support vector index. For numerical stability, average over all support vectors.

Theoretical Guarantee

When Existence Fails:

If the data is not linearly separable:

The feasible region $\mathcal{F}$ is empty
No hyperplane can satisfy all constraints
The hard margin SVM problem is infeasible

This is not a numerical failure but a fundamental limitation. It's why soft margin SVM was developed: to handle non-separable data by allowing some constraint violations.

Detecting Infeasibility:

Quadratic programming solvers will detect infeasibility and report it. In practice:

Always check for solver convergence status
Infeasibility suggests hard margin SVM is inappropriate
Consider soft margin SVM, kernel methods, or re-examining the data

Why This Formulation Is Tractable

The hard margin SVM formulation is not just theoretically elegant—it's computationally efficient. Understanding why requires appreciating its structure.

Convex Quadratic Program (QP):

The SVM primal is a convex QP:

$$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}\mathbf{w}^\top \mathbf{w} \ \text{s.t.} \quad & \mathbf{A}\begin{pmatrix} \mathbf{w} \ b \end{pmatrix} \leq \mathbf{1} \end{aligned}$$

where $\mathbf{A}$ is the matrix of constraint coefficients derived from the data.

Standard QP Form:

Convex QPs have the general form: $$\min_\mathbf{z} \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \quad \text{s.t.} \quad \mathbf{A}\mathbf{z} \leq \mathbf{b}$$

For SVM: $\mathbf{z} = (\mathbf{w}, b)$, $\mathbf{Q}$ is identity padded with zeros for $b$, $\mathbf{c} = \mathbf{0}$.

Computational Properties

•Polynomial Time: Convex QPs are solvable in polynomial time using interior point methods, with complexity $O(n^3)$ per iteration and typically $O(\sqrt{n})$ iterations.
•Global Optimum Guaranteed: Convexity ensures any local optimum is the global optimum—no risk of getting stuck in suboptimal solutions.
•Numerical Stability: Well-conditioned problem structure makes numerical solution reliable.
•Mature Algorithms: Decades of research have produced highly optimized QP solvers (CPLEX, Gurobi, MOSEK, OSQP).
•Duality: The dual problem (covered next) provides alternative solution methods and important theoretical insights.

Complexity Analysis:

Variables: $d + 1$ (feature dimension plus bias)
Constraints: $n$ (one per training example)

Using interior point methods:

Time: $O((d+n)^3)$ per iteration, typically $O(\log(1/\epsilon))$ iterations for $\epsilon$-accuracy
Space: $O((d+n)^2)$ for constraint matrix storage

For large $n$ (many examples), the dual formulation is often preferred, as we'll see. For large $d$ (high-dimensional features), the primal may be more efficient.

This complexity analysis motivates the development of specialized algorithms like SMO (Sequential Minimal Optimization), which exploit problem structure for efficiency.

Practical Implications

Summary: The Optimization Foundation

We have established the complete mathematical foundation for hard margin SVM. Let's consolidate what we've learned:

Key Takeaways

•Geometric margin measures the perpendicular distance from points to the hyperplane: $\gamma = y(\mathbf{w}^\top\mathbf{x} + b)/|\mathbf{w}|$
•Canonical form sets the minimum functional margin to 1, simplifying margin to $\gamma = 1/|\mathbf{w}|$
•Margin maximization is equivalent to minimizing $\frac{1}{2}|\mathbf{w}|^2$
•The primal problem is a convex quadratic program with linear inequality constraints
•Feasibility requires linear separability; infeasibility indicates hard margin is inappropriate
•Uniqueness of the optimal solution is guaranteed by strict convexity
•Polynomial-time solvability makes the problem computationally tractable

What's Next

The Journey So Far:

$$\text{Margin intuition} \rightarrow \text{Geometric margin formula} \rightarrow \text{Canonical normalization} \rightarrow \text{Convex QP formulation}$$

1 / 5