Hard Margin Svm - Learning Module

Loading content...

0/245

Quadratic Programming

The Optimization Engine Behind SVMs

In the previous page, we derived the hard margin SVM as a convex quadratic program (QP). But what exactly is quadratic programming, and why does it matter that SVM falls into this class?

Quadratic programming represents one of the most important and well-studied optimization problem classes. It sits at the intersection of linear and nonlinear optimization, inheriting the tractability of linear programs while providing the expressiveness needed for problems like SVM.

This page provides a rigorous treatment of QP theory and methods, with particular focus on how these apply to SVM. Understanding QP is essential for:

Appreciating why SVM can be solved efficiently
Understanding the computational trade-offs in SVM implementations
Recognizing when specialized algorithms like SMO are valuable
Interpreting solver behavior and troubleshooting numerical issues

Learning Objectives

By the end of this page, you will: (1) understand the general QP problem structure, (2) know the conditions that make QPs convex and tractable, (3) understand major solution approaches including interior point and active set methods, (4) recognize the special structure of the SVM QP, and (5) appreciate computational complexity considerations.

The General Quadratic Program

A quadratic program (QP) is an optimization problem with a quadratic objective function and linear constraints. The general form is:

$$\begin{aligned} \min_{\mathbf{z} \in \mathbb{R}^p} \quad & \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \ \text{subject to} \quad & \mathbf{A}{eq}\mathbf{z} = \mathbf{b}{eq} \ & \mathbf{A}{ineq}\mathbf{z} \leq \mathbf{b}{ineq} \end{aligned}$$

where:

$\mathbf{z} \in \mathbb{R}^p$ is the vector of decision variables
$\mathbf{Q} \in \mathbb{R}^{p \times p}$ is the quadratic coefficient matrix (usually symmetric)
$\mathbf{c} \in \mathbb{R}^p$ is the linear coefficient vector
$\mathbf{A}{eq} \in \mathbb{R}^{m_e \times p}$, $\mathbf{b}{eq} \in \mathbb{R}^{m_e}$ define equality constraints
$\mathbf{A}{ineq} \in \mathbb{R}^{m_i \times p}$, $\mathbf{b}{ineq} \in \mathbb{R}^{m_i}$ define inequality constraints

The Factor of 1/2

The factor of $\frac{1}{2}$ in the objective is a convention that simplifies notation. With this convention, the gradient of the objective is simply $\nabla f(\mathbf{z}) = \mathbf{Q}\mathbf{z} + \mathbf{c}$. Without it, we'd have an extra factor of 2 in all derivative calculations.

Components of the Objective:

The objective function $f(\mathbf{z}) = \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z}$ has two parts:

Quadratic term: $\frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z}$ captures interactions between variables
Linear term: $\mathbf{c}^\top\mathbf{z}$ represents independent contributions from each variable

Expanding the quadratic term: $$\frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} = \frac{1}{2}\sum_{i=1}^p\sum_{j=1}^p Q_{ij}z_i z_j$$

Relationship to Other Problem Classes:

Linear Program (LP): When $\mathbf{Q} = \mathbf{0}$, QP reduces to an LP
Unconstrained Quadratic: When there are no constraints, the solution (if $\mathbf{Q}$ is positive definite) is $\mathbf{z}^* = -\mathbf{Q}^{-1}\mathbf{c}$
General Nonlinear Program: QP is a special case with polynomial objective; general NLPs can have arbitrary smooth objectives

Optimization Problem Hierarchy
Problem Class	Objective	Constraints	Complexity
LP	Linear	Linear	Polynomial (Simplex/IP)
Convex QP	Convex Quadratic	Linear	Polynomial (IP)
Non-convex QP	General Quadratic	Linear	NP-hard
Convex NLP	Convex	Convex	Polynomial (IP)
General NLP	Smooth	Smooth	NP-hard in general

Convexity and the Role of Q

The matrix $\mathbf{Q}$ determines whether the QP is convex—and convexity is everything for tractability.

Convexity of the Objective:

A function $f$ is convex if its Hessian (matrix of second derivatives) is positive semidefinite everywhere. For the QP objective:

$$f(\mathbf{z}) = \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z}$$

The Hessian is: $$\nabla^2 f(\mathbf{z}) = \mathbf{Q}$$

Since $\mathbf{Q}$ is constant (doesn't depend on $\mathbf{z}$), the function is convex if and only if $\mathbf{Q}$ is positive semidefinite (PSD), denoted $\mathbf{Q} \succeq \mathbf{0}$.

Definition of Positive Semidefinite:

A symmetric matrix $\mathbf{Q}$ is positive semidefinite if: $$\mathbf{z}^\top \mathbf{Q}\mathbf{z} \geq 0 \quad \forall \mathbf{z} \in \mathbb{R}^p$$

Equivalently, all eigenvalues of $\mathbf{Q}$ are non-negative.

The Convexity Criterion

A QP is convex if and only if $\mathbf{Q} \succeq \mathbf{0}$ (positive semidefinite). When $\mathbf{Q} \succ \mathbf{0}$ (positive definite, all eigenvalues strictly positive), the QP is strictly convex, guaranteeing a unique global optimum.

Why Convexity Matters:

Property	Convex QP	Non-convex QP
Local = Global	Yes	No
Unique solution	Yes (if strictly convex)	Possibly multiple
Complexity	Polynomial	NP-hard
Solver behavior	Reliable	May find local optima

For SVM, What Is Q?

Recall the SVM primal: $$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\mathbf{w}^\top\mathbf{w}$$

With decision variable $\mathbf{z} = (\mathbf{w}, b)^\top \in \mathbb{R}^{d+1}$:

$$\mathbf{Q} = \begin{pmatrix} \mathbf{I}_d & \mathbf{0} \ \mathbf{0}^\top & 0 \end{pmatrix}$$

This matrix has $d$ eigenvalues equal to 1 and one eigenvalue equal to 0 (corresponding to $b$). Since all eigenvalues are $\geq 0$, $\mathbf{Q}$ is positive semidefinite, confirming the SVM primal is a convex QP.

Note: The zero eigenvalue means $\mathbf{Q}$ is not strictly positive definite, so the objective is convex but not strictly convex in $(\mathbf{w}, b)$. However, it is strictly convex in $\mathbf{w}$, which is sufficient for uniqueness of $\mathbf{w}^*$.

Summary: Q Matrix Classes

•$\mathbf{Q} \succ 0$ (positive definite): Strictly convex QP. Unique global minimum. SVM dual falls here.
•$\mathbf{Q} \succeq 0$ (positive semidefinite): Convex QP. Global minimum exists; may not be unique. SVM primal falls here.
•$\mathbf{Q}$ indefinite: Non-convex QP. May have multiple local minima. NP-hard in general.
•$\mathbf{Q} \preceq 0$ (negative semidefinite): Convex maximization (equivalent to concave minimization). Unbounded below unless constrained.

Optimality Conditions: KKT for QPs

The Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for optimality in convex QPs. Understanding KKT is crucial because it forms the theoretical basis for most QP solution methods.

The KKT System for QP:

For a QP with only inequality constraints (like SVM primal): $$\begin{aligned} \min_\mathbf{z} \quad & \frac{1}{2}\mathbf{z}^\top\mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \ \text{s.t.} \quad & \mathbf{a}_i^\top\mathbf{z} \leq b_i, \quad i = 1,\ldots,m \end{aligned}$$

The KKT conditions are:

1. Stationarity: $$\mathbf{Q}\mathbf{z} + \mathbf{c} + \sum_{i=1}^m \lambda_i \mathbf{a}_i = \mathbf{0}$$

2. Primal Feasibility: $$\mathbf{a}_i^\top\mathbf{z} \leq b_i \quad \forall i$$

3. Dual Feasibility: $$\lambda_i \geq 0 \quad \forall i$$

4. Complementary Slackness: $$\lambda_i(\mathbf{a}_i^\top\mathbf{z} - b_i) = 0 \quad \forall i$$

Interpreting Complementary Slackness

Complementary slackness says: for each constraint, either $\lambda_i = 0$ (multiplier is zero) or $\mathbf{a}_i^\top\mathbf{z} = b_i$ (constraint is active/tight). This means multipliers can only be positive for binding constraints. In SVM terms: only support vectors have non-zero multipliers.

Why KKT Gives Necessary and Sufficient Conditions:

For convex optimization problems satisfying constraint qualification (e.g., Slater's condition: a strictly feasible point exists):

Necessity: Any optimal point must satisfy KKT
Sufficiency: Any point satisfying KKT is optimal

This bidirectional relationship means solving the KKT system is equivalent to solving the optimization problem.

Matrix Form of KKT:

For the QP with equality constraints $\mathbf{A}\mathbf{z} = \mathbf{b}$, KKT becomes a linear system:

$$\begin{pmatrix} \mathbf{Q} & \mathbf{A}^\top \ \mathbf{A} & \mathbf{0} \end{pmatrix} \begin{pmatrix} \mathbf{z} \ \boldsymbol{\lambda} \end{pmatrix} = \begin{pmatrix} -\mathbf{c} \ \mathbf{b} \end{pmatrix}$$

This is a $(p + m) \times (p + m)$ linear system called the KKT matrix system. When $\mathbf{Q} \succ 0$, this system has a unique solution.

For Inequality Constraints:

With inequalities, KKT is not a simple linear system due to complementary slackness. Solution methods must handle the combinatorial aspect of determining which constraints are active.

Interior Point Methods

Interior point methods (IPMs) are the most theoretically elegant and practically efficient algorithms for convex QPs. They approach KKT conditions by following a path through the interior of the feasible region.

The Central Path Idea:

Instead of handling complementary slackness exactly, IPMs relax it with a barrier parameter $\mu > 0$:

$$\lambda_i s_i = \mu \quad \forall i$$

where $s_i = b_i - \mathbf{a}_i^\top\mathbf{z} \geq 0$ is the slack variable.

As $\mu \to 0$, the relaxed condition approaches exact complementarity. IPMs follow this central path from a large $\mu$ (easy problem) to $\mu \to 0$ (original problem).

IPM Convergence Guarantee

Interior point methods achieve $\epsilon$-optimality in $O(\sqrt{m}\log(1/\epsilon))$ iterations, where $m$ is the number of constraints. Each iteration requires solving a linear system of size $(p + m)$, typically $O((p+m)^3)$ operations. This gives polynomial-time complexity for convex QPs.

Primal-Dual Interior Point Method:

The most common variant simultaneously updates primal variables $\mathbf{z}$ and dual variables $\boldsymbol{\lambda}$:

Algorithm Outline:

Start with feasible $(\mathbf{z}^0, \boldsymbol{\lambda}^0, \mathbf{s}^0)$ with $\boldsymbol{\lambda}, \mathbf{s} > 0$
For $k = 0, 1, 2, \ldots$:
- Set barrier parameter $\mu_k = \sigma (\boldsymbol{\lambda}^k)^\top \mathbf{s}^k / m$ for $\sigma \in (0,1)$
- Solve the Newton system for search direction $(\Delta\mathbf{z}, \Delta\boldsymbol{\lambda}, \Delta\mathbf{s})$
- Choose step size $\alpha$ to maintain positivity
- Update: $(\mathbf{z}^{k+1}, \boldsymbol{\lambda}^{k+1}, \mathbf{s}^{k+1}) = (\mathbf{z}^k, \boldsymbol{\lambda}^k, \mathbf{s}^k) + \alpha(\Delta\mathbf{z}, \Delta\boldsymbol{\lambda}, \Delta\mathbf{s})$
Stop when $\mu_k < \epsilon$

Properties:

Polynomial-time convergence guaranteed
Robust numerical behavior
Each iteration is expensive (linear system solve)
Works well for moderate-sized problems

Interior Point Characteristics

•Global convergence: Theoretical guarantee of finding optimal solution
•Iteration count: Typically 10-50 iterations regardless of problem size
•Per-iteration cost: Dominated by solving $(p+m) \times (p+m)$ linear system
•Modern implementations: MOSEK, Gurobi, CPLEX use sophisticated IPM variants
•Warm starting: Less effective than active set methods; new problem requires fresh iterations

Active Set Methods

Active set methods represent a different philosophy: they explicitly track which inequality constraints are active (tight) and solve a sequence of equality-constrained subproblems.

Core Idea:

At the optimal solution, some constraints are active ($\mathbf{a}_i^\top\mathbf{z}^* = b_i$) and others are inactive ($\mathbf{a}_i^\top\mathbf{z}^* < b_i$). If we knew which constraints were active, we could:

Treat active constraints as equalities
Ignore inactive constraints
Solve the resulting equality-constrained QP

The challenge is we don't know the active set in advance. Active set methods iteratively discover it.

Algorithm Outline:

Start with initial feasible point $\mathbf{z}^0$ and working active set $\mathcal{W}^0$
Solve equality-constrained QP with constraints in $\mathcal{W}^k$:
- Compute descent direction $\mathbf{d}^k$
- If $\mathbf{d}^k = \mathbf{0}$: check optimality via multipliers
Line search: find step $\alpha$ maintaining feasibility
Update active set:
- If blocked by inactive constraint: add to $\mathcal{W}$
- If multiplier negative: remove from $\mathcal{W}$
Repeat until optimal

Complexity of Active Set Methods

In the worst case, active set methods may require examining all $2^m$ possible active sets (exponential). In practice, they typically perform well, especially when started near the solution (warm starting). However, they lack the polynomial-time guarantee of interior point methods.

Advantages of Active Set Methods:

Exact solution: No barrier relaxation; solution satisfies KKT exactly
Warm starting: Excellent performance when starting from nearby solution
Sparsity exploitation: Only active constraints need representation
Interpretability: Directly identifies support vectors (active constraints)

When Active Set Excels:

Problems with few active constraints at optimum (true for many SVMs)
Solving sequences of related problems (regularization paths)
When exact constraint satisfaction is critical
Smaller to medium-sized problems

Comparison:

Aspect	Interior Point	Active Set
Worst-case complexity	Polynomial	Exponential
Typical behavior	Predictable	Problem-dependent
Warm starting	Limited benefit	Excellent
Solution exactness	Approximate	Exact
Very large problems	Preferred	Impractical

Special Structure of the SVM QP

The SVM optimization problem has special structure that both enables and motivates specialized solution methods.

The SVM Primal as Standard QP:

Recall the hard margin SVM primal: $$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1, \quad i = 1,\ldots,n \end{aligned}$$

Casting this as standard QP with $\mathbf{z} = (\mathbf{w}^\top, b)^\top \in \mathbb{R}^{d+1}$:

$$\mathbf{Q} = \begin{pmatrix} \mathbf{I}_d & \mathbf{0}_d \ \mathbf{0}_d^\top & 0 \end{pmatrix}, \quad \mathbf{c} = \mathbf{0}$$

Constraints: $-y_i(\mathbf{w}^\top\mathbf{x}i + b) \leq -1$, giving: $$\mathbf{A}{ineq} = \begin{pmatrix} -y_1\mathbf{x}_1^\top & -y_1 \ \vdots & \vdots \ -y_n\mathbf{x}n^\top & -y_n \end{pmatrix}, \quad \mathbf{b}{ineq} = -\mathbf{1}$$

Key Structural Properties

The SVM primal QP has: (1) Diagonal $\mathbf{Q}$ (identity block plus zero), making the objective separable in components of $\mathbf{w}$; (2) Dense constraint matrix when features are dense; (3) $n$ constraints and $d+1$ variables; (4) Positive semidefinite $\mathbf{Q}$ with one zero eigenvalue.

Problem Dimensions:

Quantity	Symbol	Meaning
Variables	$d + 1$	Feature dimension plus bias
Constraints	$n$	One per training example
Active constraints at optimum	$n_{SV}$	Number of support vectors

When Is Primal vs Dual Preferred?

Primal preferred when $d \ll n$: Few features, many examples. Direct solution in $O(d^3)$ per iteration.
Dual preferred when $d \gg n$: Many features (or infinite with kernels), fewer examples. Solution complexity depends on $n$.
Kernels require dual: The dual allows "kernel trick" avoiding explicit feature computation.

SVM-Specific Sparsity:

A remarkable property of SVMs: at the optimum, most constraints are inactive. Only support vectors have active constraints. This means:

Effective problem size is $n_{SV} \ll n$
Support vector sparsity enables efficient prediction
Active set methods can exploit this structure

SVM Structure Summary

•Convex: $\mathbf{Q} \succeq 0$ guarantees unique global optimum for $\mathbf{w}$
•Sparse solution: Only $n_{SV}$ constraints active, where typically $n_{SV} \ll n$
•Dense constraints: Each constraint involves all features (for dense data)
•No linear term: $\mathbf{c} = \mathbf{0}$, objective is purely quadratic
•Dimension choice: Primal works in $\mathbb{R}^{d+1}$, dual in $\mathbb{R}^n$

Computational Complexity Analysis

Understanding the computational complexity of solving the SVM QP is crucial for practical deployment.

General QP Complexity:

For a QP with $p$ variables and $m$ inequality constraints:

Interior Point: $O(p^3 + p^2 m + pm^2)$ per iteration, $O(\sqrt{m})$ iterations → Total: $O(\sqrt{m}(p^3 + pm^2))$
Active Set: $O(p^3)$ per iteration, potentially $O(2^m)$ iterations in worst case

For SVM Primal ($p = d+1$, $m = n$):

Interior Point complexity: $O(\sqrt{n}(d^3 + d n^2))$

When $n \gg d$: $O(\sqrt{n} \cdot n^2 \cdot d) = O(n^{2.5}d)$
When $d \gg n$: $O(\sqrt{n} \cdot d^3) = O(\sqrt{n}d^3)$

Memory Considerations

The constraint matrix $\mathbf{A}_{ineq} \in \mathbb{R}^{n \times (d+1)}$ requires $O(nd)$ storage. For large-scale SVM with millions of examples and thousands of features, this matrix alone requires gigabytes of memory. This motivates specialized algorithms that avoid forming the full matrix.

Practical Complexity:

Real-world performance often differs from worst-case bounds:

Problem Size	Interior Point	Specialized SVM
Small ($n < 1000$)	Fast, reliable	Overkill
Medium ($n \sim 10^4$)	Feasible	Preferred
Large ($n \sim 10^5$)	Memory issues	Required
Very Large ($n > 10^6$)	Impractical	Approximate methods

Why Specialized Algorithms?

General QP solvers don't exploit SVM structure:

Support vector sparsity: Most constraints inactive → Solution depends only on SVs
Kernel structure: Dual form enables kernel trick
Decomposition: Can solve subproblems involving few variables

This motivates algorithms like:

SMO (Sequential Minimal Optimization): Solves 2-variable subproblems
Chunking: Solves larger subproblems including likely SVs
Decomposition methods: General framework for working set selection

Complexity Comparison for SVM Training
Method	Time Complexity	Space Complexity	Best For
Interior Point (Primal)	$O(n^{2.5}d)$	$O(nd)$	Small to medium $n$
Interior Point (Dual)	$O(n^{3.5})$	$O(n^2)$	Small $n$, kernels
SMO	$O(n^2 d)$ typical	$O(n)$	Large $n$, sparse SVs
Stochastic/Online	$O(nd)$ per pass	$O(d)$	Very large $n$, approximate

Practical QP Solvers for SVM

Understanding available QP solvers helps in selecting the right tool for your SVM implementation.

Commercial Solvers:

Solver	Method	Strengths
CPLEX	IPM, Simplex	Industrial strength, excellent support
Gurobi	IPM, Simplex	Very fast, good APIs
MOSEK	IPM	Specializes in conic optimization

These solvers are highly optimized but require licenses for commercial use.

Open Source Solvers:

Solver	Method	Strengths
OSQP	IPM (first-order)	Fast for medium problems
CVXOPT	IPM	Python-friendly
qpOASES	Active Set	Good for warm-starting

SVM-Specific Libraries:

Library	Algorithm	Notes
LIBSVM	SMO variant	Gold standard for SVM
LIBLINEAR	Dual Coordinate Descent	For linear SVM, very fast
scikit-learn	Uses LIBSVM/LIBLINEAR	Easy-to-use Python interface

Choosing a Solver

For production SVM: use LIBSVM or scikit-learn. For custom QP formulations: use OSQP or CVXOPT. For research/understanding: implement basic algorithms yourself. For enterprise: consider commercial solvers if performance is critical.

Solver Selection Criteria:

Problem size: Large problems need specialized methods
Kernel vs linear: Kernels require careful memory management
Accuracy requirements: IPM gives high accuracy; SGD is approximate
Integration needs: API quality, language bindings matter
Licensing: Commercial vs open source considerations

Example: Solving SVM with Python

# Using scikit-learn (wraps LIBSVM)
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1.0)
clf.fit(X_train, y_train)

# Using CVXOPT for custom QP
from cvxopt import matrix, solvers
Q = matrix(np.eye(d+1), tc='d')
c = matrix(np.zeros(d+1), tc='d')
G = matrix(-y[:, None] * np.c_[X, np.ones(n)], tc='d')
h = matrix(-np.ones(n), tc='d')
sol = solvers.qp(Q, c, G, h)

For practical SVM, prefer specialized libraries. Only use general QP solvers when you need custom formulations or educational understanding.

Numerical Considerations and Pitfalls

Even convex QPs can present numerical challenges. Understanding these helps diagnose and fix solver issues.

Conditioning of the Q Matrix:

The condition number $\kappa(\mathbf{Q}) = |\mathbf{Q}| \cdot |\mathbf{Q}^{-1}|$ affects solver stability:

$\kappa \approx 1$: Well-conditioned, easy to solve
$\kappa \gg 1$: Ill-conditioned, numerical difficulties
$\kappa = \infty$: Singular matrix, matrix operations fail

For SVM primal, $\mathbf{Q}$ has one zero eigenvalue (from bias $b$), making it singular. Solvers handle this, but it can cause issues with naive implementations.

Feature Scaling:

When features have vastly different scales, the optimization landscape becomes ill-conditioned:

Issue	Symptom	Solution
Features on different scales	Slow convergence	Standardize features
Very large feature values	Numerical overflow	Normalize
Very small feature values	Underflow, precision loss	Scale up

Standard preprocessing: Standardize each feature to zero mean and unit variance.

Always Scale Features for SVM

SVM performance is highly sensitive to feature scaling because the margin is measured in the feature space. Features with larger magnitude will dominate the margin calculation. Always standardize: $x'_j = (x_j - \mu_j) / \sigma_j$ for each feature $j$.

Infeasibility Detection:

When data is not linearly separable, the hard margin SVM QP is infeasible:

Solver may report "infeasible" or fail to converge
May produce garbage solutions with large residuals

Diagnosis:

Check solver status flag
Verify constraint violations: compute $y_i(\mathbf{w}^\top\mathbf{x}_i + b)$ for all $i$
If any are negative, constraints aren't satisfied

Solution: Use soft margin SVM with slack variables.

Numerical Tolerance:

Solvers use tolerances for:

Primal feasibility: How closely constraints are satisfied
Dual feasibility: How closely KKT conditions hold
Optimality gap: Difference between primal and dual objectives

Default tolerances (typically $10^{-6}$ to $10^{-8}$) are usually appropriate. Tighter tolerances increase computation time; looser tolerances may yield suboptimal solutions.

Common Numerical Issues

•Ill-conditioning: Large condition number causes slow convergence. Solution: Scale data, use regularization.
•Near-infeasibility: Data barely separable causes numerical instability. Solution: Use soft margin.
•Overflow/underflow: Extreme feature values. Solution: Normalize to reasonable range.
•Non-convergence: Solver exceeds iteration limit. Solution: Increase limit, check problem formulation.
•Different results across solvers: Due to numerical precision. Usually indicates ill-conditioning.

Summary: QP as the Computational Foundation

We have now thoroughly examined quadratic programming as the computational foundation for hard margin SVM. Let's consolidate the key insights:

Key Takeaways

•QP structure: Quadratic objective with linear constraints; convexity determined by $\mathbf{Q} \succeq 0$
•SVM is convex QP: The SVM primal has $\mathbf{Q} = \text{diag}(\mathbf{I}_d, 0)$, guaranteeing convexity
•KKT conditions: Provide necessary and sufficient optimality conditions for the convex case
•Interior point methods: Polynomial-time with reliable convergence, preferred for medium problems
•Active set methods: Exploit sparsity, excellent for warm-starting, exponential worst-case
•SVM sparsity: Only support vectors have active constraints, enabling efficient prediction
•Practical complexity: General QP is $O(n^{2.5})$; specialized SVM methods achieve $O(n^2)$ typical case
•Numerical care: Feature scaling is essential; watch for ill-conditioning and infeasibility

What's Next

With QP understood, we're ready to derive the SVM solution using Lagrangian methods. The next page introduces Lagrangian duality—a powerful framework that transforms the primal QP into an equivalent dual problem. This dual perspective reveals why only support vectors matter and enables the famous kernel trick.

The Story So Far:

$$\text{Margin concept} \xrightarrow{\text{Page 0}} \text{QP formulation} \xrightarrow{\text{This page}} \text{QP solver theory} \xrightarrow{\text{Next}} \text{Lagrangian duality}$$

We've established that SVM is a well-posed optimization problem with guaranteed unique solution, solvable in polynomial time. The Lagrangian approach will reveal the underlying structure more deeply, showing why SVM is not just solvable, but elegantly so.

Quadratic Programming

The Optimization Engine Behind SVMs

In the previous page, we derived the hard margin SVM as a convex quadratic program (QP). But what exactly is quadratic programming, and why does it matter that SVM falls into this class?

This page provides a rigorous treatment of QP theory and methods, with particular focus on how these apply to SVM. Understanding QP is essential for:

Appreciating why SVM can be solved efficiently
Understanding the computational trade-offs in SVM implementations
Recognizing when specialized algorithms like SMO are valuable
Interpreting solver behavior and troubleshooting numerical issues

Learning Objectives

The General Quadratic Program

A quadratic program (QP) is an optimization problem with a quadratic objective function and linear constraints. The general form is:

where:

$\mathbf{z} \in \mathbb{R}^p$ is the vector of decision variables
$\mathbf{Q} \in \mathbb{R}^{p \times p}$ is the quadratic coefficient matrix (usually symmetric)
$\mathbf{c} \in \mathbb{R}^p$ is the linear coefficient vector
$\mathbf{A}{eq} \in \mathbb{R}^{m_e \times p}$, $\mathbf{b}{eq} \in \mathbb{R}^{m_e}$ define equality constraints
$\mathbf{A}{ineq} \in \mathbb{R}^{m_i \times p}$, $\mathbf{b}{ineq} \in \mathbb{R}^{m_i}$ define inequality constraints

The Factor of 1/2

Components of the Objective:

The objective function $f(\mathbf{z}) = \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z}$ has two parts:

Quadratic term: $\frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z}$ captures interactions between variables
Linear term: $\mathbf{c}^\top\mathbf{z}$ represents independent contributions from each variable

Expanding the quadratic term: $$\frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} = \frac{1}{2}\sum_{i=1}^p\sum_{j=1}^p Q_{ij}z_i z_j$$

Relationship to Other Problem Classes:

Linear Program (LP): When $\mathbf{Q} = \mathbf{0}$, QP reduces to an LP
Unconstrained Quadratic: When there are no constraints, the solution (if $\mathbf{Q}$ is positive definite) is $\mathbf{z}^* = -\mathbf{Q}^{-1}\mathbf{c}$
General Nonlinear Program: QP is a special case with polynomial objective; general NLPs can have arbitrary smooth objectives

Optimization Problem Hierarchy
Problem Class	Objective	Constraints	Complexity
LP	Linear	Linear	Polynomial (Simplex/IP)
Convex QP	Convex Quadratic	Linear	Polynomial (IP)
Non-convex QP	General Quadratic	Linear	NP-hard
Convex NLP	Convex	Convex	Polynomial (IP)
General NLP	Smooth	Smooth	NP-hard in general

Convexity and the Role of Q

The matrix $\mathbf{Q}$ determines whether the QP is convex—and convexity is everything for tractability.

Convexity of the Objective:

A function $f$ is convex if its Hessian (matrix of second derivatives) is positive semidefinite everywhere. For the QP objective:

$$f(\mathbf{z}) = \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z}$$

The Hessian is: $$\nabla^2 f(\mathbf{z}) = \mathbf{Q}$$

Since $\mathbf{Q}$ is constant (doesn't depend on $\mathbf{z}$), the function is convex if and only if $\mathbf{Q}$ is positive semidefinite (PSD), denoted $\mathbf{Q} \succeq \mathbf{0}$.

Definition of Positive Semidefinite:

A symmetric matrix $\mathbf{Q}$ is positive semidefinite if: $$\mathbf{z}^\top \mathbf{Q}\mathbf{z} \geq 0 \quad \forall \mathbf{z} \in \mathbb{R}^p$$

Equivalently, all eigenvalues of $\mathbf{Q}$ are non-negative.

The Convexity Criterion

Why Convexity Matters:

Property	Convex QP	Non-convex QP
Local = Global	Yes	No
Unique solution	Yes (if strictly convex)	Possibly multiple
Complexity	Polynomial	NP-hard
Solver behavior	Reliable	May find local optima

For SVM, What Is Q?

Recall the SVM primal: $$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\mathbf{w}^\top\mathbf{w}$$

With decision variable $\mathbf{z} = (\mathbf{w}, b)^\top \in \mathbb{R}^{d+1}$:

$$\mathbf{Q} = \begin{pmatrix} \mathbf{I}_d & \mathbf{0} \ \mathbf{0}^\top & 0 \end{pmatrix}$$

Summary: Q Matrix Classes

•$\mathbf{Q} \succ 0$ (positive definite): Strictly convex QP. Unique global minimum. SVM dual falls here.
•$\mathbf{Q} \succeq 0$ (positive semidefinite): Convex QP. Global minimum exists; may not be unique. SVM primal falls here.
•$\mathbf{Q}$ indefinite: Non-convex QP. May have multiple local minima. NP-hard in general.
•$\mathbf{Q} \preceq 0$ (negative semidefinite): Convex maximization (equivalent to concave minimization). Unbounded below unless constrained.

Optimality Conditions: KKT for QPs

The KKT System for QP:

The KKT conditions are:

1. Stationarity: $$\mathbf{Q}\mathbf{z} + \mathbf{c} + \sum_{i=1}^m \lambda_i \mathbf{a}_i = \mathbf{0}$$

2. Primal Feasibility: $$\mathbf{a}_i^\top\mathbf{z} \leq b_i \quad \forall i$$

3. Dual Feasibility: $$\lambda_i \geq 0 \quad \forall i$$

4. Complementary Slackness: $$\lambda_i(\mathbf{a}_i^\top\mathbf{z} - b_i) = 0 \quad \forall i$$

Interpreting Complementary Slackness

Why KKT Gives Necessary and Sufficient Conditions:

For convex optimization problems satisfying constraint qualification (e.g., Slater's condition: a strictly feasible point exists):

Necessity: Any optimal point must satisfy KKT
Sufficiency: Any point satisfying KKT is optimal

This bidirectional relationship means solving the KKT system is equivalent to solving the optimization problem.

Matrix Form of KKT:

For the QP with equality constraints $\mathbf{A}\mathbf{z} = \mathbf{b}$, KKT becomes a linear system:

This is a $(p + m) \times (p + m)$ linear system called the KKT matrix system. When $\mathbf{Q} \succ 0$, this system has a unique solution.

For Inequality Constraints:

With inequalities, KKT is not a simple linear system due to complementary slackness. Solution methods must handle the combinatorial aspect of determining which constraints are active.

Interior Point Methods

The Central Path Idea:

Instead of handling complementary slackness exactly, IPMs relax it with a barrier parameter $\mu > 0$:

$$\lambda_i s_i = \mu \quad \forall i$$

where $s_i = b_i - \mathbf{a}_i^\top\mathbf{z} \geq 0$ is the slack variable.

As $\mu \to 0$, the relaxed condition approaches exact complementarity. IPMs follow this central path from a large $\mu$ (easy problem) to $\mu \to 0$ (original problem).

IPM Convergence Guarantee

Primal-Dual Interior Point Method:

The most common variant simultaneously updates primal variables $\mathbf{z}$ and dual variables $\boldsymbol{\lambda}$:

Algorithm Outline:

Start with feasible $(\mathbf{z}^0, \boldsymbol{\lambda}^0, \mathbf{s}^0)$ with $\boldsymbol{\lambda}, \mathbf{s} > 0$
For $k = 0, 1, 2, \ldots$:
- Set barrier parameter $\mu_k = \sigma (\boldsymbol{\lambda}^k)^\top \mathbf{s}^k / m$ for $\sigma \in (0,1)$
- Solve the Newton system for search direction $(\Delta\mathbf{z}, \Delta\boldsymbol{\lambda}, \Delta\mathbf{s})$
- Choose step size $\alpha$ to maintain positivity
- Update: $(\mathbf{z}^{k+1}, \boldsymbol{\lambda}^{k+1}, \mathbf{s}^{k+1}) = (\mathbf{z}^k, \boldsymbol{\lambda}^k, \mathbf{s}^k) + \alpha(\Delta\mathbf{z}, \Delta\boldsymbol{\lambda}, \Delta\mathbf{s})$
Stop when $\mu_k < \epsilon$

Properties:

Polynomial-time convergence guaranteed
Robust numerical behavior
Each iteration is expensive (linear system solve)
Works well for moderate-sized problems

Interior Point Characteristics

•Global convergence: Theoretical guarantee of finding optimal solution
•Iteration count: Typically 10-50 iterations regardless of problem size
•Per-iteration cost: Dominated by solving $(p+m) \times (p+m)$ linear system
•Modern implementations: MOSEK, Gurobi, CPLEX use sophisticated IPM variants
•Warm starting: Less effective than active set methods; new problem requires fresh iterations

Active Set Methods

Active set methods represent a different philosophy: they explicitly track which inequality constraints are active (tight) and solve a sequence of equality-constrained subproblems.

Core Idea:

Treat active constraints as equalities
Ignore inactive constraints
Solve the resulting equality-constrained QP

The challenge is we don't know the active set in advance. Active set methods iteratively discover it.

Algorithm Outline:

Start with initial feasible point $\mathbf{z}^0$ and working active set $\mathcal{W}^0$
Solve equality-constrained QP with constraints in $\mathcal{W}^k$:
- Compute descent direction $\mathbf{d}^k$
- If $\mathbf{d}^k = \mathbf{0}$: check optimality via multipliers
Line search: find step $\alpha$ maintaining feasibility
Update active set:
- If blocked by inactive constraint: add to $\mathcal{W}$
- If multiplier negative: remove from $\mathcal{W}$
Repeat until optimal

Complexity of Active Set Methods

Advantages of Active Set Methods:

Exact solution: No barrier relaxation; solution satisfies KKT exactly
Warm starting: Excellent performance when starting from nearby solution
Sparsity exploitation: Only active constraints need representation
Interpretability: Directly identifies support vectors (active constraints)

When Active Set Excels:

Problems with few active constraints at optimum (true for many SVMs)
Solving sequences of related problems (regularization paths)
When exact constraint satisfaction is critical
Smaller to medium-sized problems

Comparison:

Aspect	Interior Point	Active Set
Worst-case complexity	Polynomial	Exponential
Typical behavior	Predictable	Problem-dependent
Warm starting	Limited benefit	Excellent
Solution exactness	Approximate	Exact
Very large problems	Preferred	Impractical

Special Structure of the SVM QP

The SVM optimization problem has special structure that both enables and motivates specialized solution methods.

The SVM Primal as Standard QP:

Casting this as standard QP with $\mathbf{z} = (\mathbf{w}^\top, b)^\top \in \mathbb{R}^{d+1}$:

$$\mathbf{Q} = \begin{pmatrix} \mathbf{I}_d & \mathbf{0}_d \ \mathbf{0}_d^\top & 0 \end{pmatrix}, \quad \mathbf{c} = \mathbf{0}$$

Key Structural Properties

Problem Dimensions:

Quantity	Symbol	Meaning
Variables	$d + 1$	Feature dimension plus bias
Constraints	$n$	One per training example
Active constraints at optimum	$n_{SV}$	Number of support vectors

When Is Primal vs Dual Preferred?

Primal preferred when $d \ll n$: Few features, many examples. Direct solution in $O(d^3)$ per iteration.
Dual preferred when $d \gg n$: Many features (or infinite with kernels), fewer examples. Solution complexity depends on $n$.
Kernels require dual: The dual allows "kernel trick" avoiding explicit feature computation.

SVM-Specific Sparsity:

A remarkable property of SVMs: at the optimum, most constraints are inactive. Only support vectors have active constraints. This means:

Effective problem size is $n_{SV} \ll n$
Support vector sparsity enables efficient prediction
Active set methods can exploit this structure

SVM Structure Summary

•Convex: $\mathbf{Q} \succeq 0$ guarantees unique global optimum for $\mathbf{w}$
•Sparse solution: Only $n_{SV}$ constraints active, where typically $n_{SV} \ll n$
•Dense constraints: Each constraint involves all features (for dense data)
•No linear term: $\mathbf{c} = \mathbf{0}$, objective is purely quadratic
•Dimension choice: Primal works in $\mathbb{R}^{d+1}$, dual in $\mathbb{R}^n$

Computational Complexity Analysis

Understanding the computational complexity of solving the SVM QP is crucial for practical deployment.

General QP Complexity:

For a QP with $p$ variables and $m$ inequality constraints:

Interior Point: $O(p^3 + p^2 m + pm^2)$ per iteration, $O(\sqrt{m})$ iterations → Total: $O(\sqrt{m}(p^3 + pm^2))$
Active Set: $O(p^3)$ per iteration, potentially $O(2^m)$ iterations in worst case

For SVM Primal ($p = d+1$, $m = n$):

Interior Point complexity: $O(\sqrt{n}(d^3 + d n^2))$

When $n \gg d$: $O(\sqrt{n} \cdot n^2 \cdot d) = O(n^{2.5}d)$
When $d \gg n$: $O(\sqrt{n} \cdot d^3) = O(\sqrt{n}d^3)$

Memory Considerations

Practical Complexity:

Real-world performance often differs from worst-case bounds:

Problem Size	Interior Point	Specialized SVM
Small ($n < 1000$)	Fast, reliable	Overkill
Medium ($n \sim 10^4$)	Feasible	Preferred
Large ($n \sim 10^5$)	Memory issues	Required
Very Large ($n > 10^6$)	Impractical	Approximate methods

Why Specialized Algorithms?

General QP solvers don't exploit SVM structure:

Support vector sparsity: Most constraints inactive → Solution depends only on SVs
Kernel structure: Dual form enables kernel trick
Decomposition: Can solve subproblems involving few variables

This motivates algorithms like:

SMO (Sequential Minimal Optimization): Solves 2-variable subproblems
Chunking: Solves larger subproblems including likely SVs
Decomposition methods: General framework for working set selection

Complexity Comparison for SVM Training
Method	Time Complexity	Space Complexity	Best For
Interior Point (Primal)	$O(n^{2.5}d)$	$O(nd)$	Small to medium $n$
Interior Point (Dual)	$O(n^{3.5})$	$O(n^2)$	Small $n$, kernels
SMO	$O(n^2 d)$ typical	$O(n)$	Large $n$, sparse SVs
Stochastic/Online	$O(nd)$ per pass	$O(d)$	Very large $n$, approximate

Practical QP Solvers for SVM

Understanding available QP solvers helps in selecting the right tool for your SVM implementation.

Commercial Solvers:

Solver	Method	Strengths
CPLEX	IPM, Simplex	Industrial strength, excellent support
Gurobi	IPM, Simplex	Very fast, good APIs
MOSEK	IPM	Specializes in conic optimization

These solvers are highly optimized but require licenses for commercial use.

Open Source Solvers:

Solver	Method	Strengths
OSQP	IPM (first-order)	Fast for medium problems
CVXOPT	IPM	Python-friendly
qpOASES	Active Set	Good for warm-starting

SVM-Specific Libraries:

Library	Algorithm	Notes
LIBSVM	SMO variant	Gold standard for SVM
LIBLINEAR	Dual Coordinate Descent	For linear SVM, very fast
scikit-learn	Uses LIBSVM/LIBLINEAR	Easy-to-use Python interface

Choosing a Solver

Solver Selection Criteria:

Problem size: Large problems need specialized methods
Kernel vs linear: Kernels require careful memory management
Accuracy requirements: IPM gives high accuracy; SGD is approximate
Integration needs: API quality, language bindings matter
Licensing: Commercial vs open source considerations

Example: Solving SVM with Python

# Using scikit-learn (wraps LIBSVM)
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1.0)
clf.fit(X_train, y_train)

# Using CVXOPT for custom QP
from cvxopt import matrix, solvers
Q = matrix(np.eye(d+1), tc='d')
c = matrix(np.zeros(d+1), tc='d')
G = matrix(-y[:, None] * np.c_[X, np.ones(n)], tc='d')
h = matrix(-np.ones(n), tc='d')
sol = solvers.qp(Q, c, G, h)

For practical SVM, prefer specialized libraries. Only use general QP solvers when you need custom formulations or educational understanding.

Numerical Considerations and Pitfalls

Even convex QPs can present numerical challenges. Understanding these helps diagnose and fix solver issues.

Conditioning of the Q Matrix:

The condition number $\kappa(\mathbf{Q}) = |\mathbf{Q}| \cdot |\mathbf{Q}^{-1}|$ affects solver stability:

$\kappa \approx 1$: Well-conditioned, easy to solve
$\kappa \gg 1$: Ill-conditioned, numerical difficulties
$\kappa = \infty$: Singular matrix, matrix operations fail

For SVM primal, $\mathbf{Q}$ has one zero eigenvalue (from bias $b$), making it singular. Solvers handle this, but it can cause issues with naive implementations.

Feature Scaling:

When features have vastly different scales, the optimization landscape becomes ill-conditioned:

Issue	Symptom	Solution
Features on different scales	Slow convergence	Standardize features
Very large feature values	Numerical overflow	Normalize
Very small feature values	Underflow, precision loss	Scale up

Standard preprocessing: Standardize each feature to zero mean and unit variance.

Always Scale Features for SVM

Infeasibility Detection:

When data is not linearly separable, the hard margin SVM QP is infeasible:

Solver may report "infeasible" or fail to converge
May produce garbage solutions with large residuals

Diagnosis:

Check solver status flag
Verify constraint violations: compute $y_i(\mathbf{w}^\top\mathbf{x}_i + b)$ for all $i$
If any are negative, constraints aren't satisfied

Solution: Use soft margin SVM with slack variables.

Numerical Tolerance:

Solvers use tolerances for:

Primal feasibility: How closely constraints are satisfied
Dual feasibility: How closely KKT conditions hold
Optimality gap: Difference between primal and dual objectives

Default tolerances (typically $10^{-6}$ to $10^{-8}$) are usually appropriate. Tighter tolerances increase computation time; looser tolerances may yield suboptimal solutions.

Common Numerical Issues

•Ill-conditioning: Large condition number causes slow convergence. Solution: Scale data, use regularization.
•Near-infeasibility: Data barely separable causes numerical instability. Solution: Use soft margin.
•Overflow/underflow: Extreme feature values. Solution: Normalize to reasonable range.
•Non-convergence: Solver exceeds iteration limit. Solution: Increase limit, check problem formulation.
•Different results across solvers: Due to numerical precision. Usually indicates ill-conditioning.

Summary: QP as the Computational Foundation

We have now thoroughly examined quadratic programming as the computational foundation for hard margin SVM. Let's consolidate the key insights:

Key Takeaways

•QP structure: Quadratic objective with linear constraints; convexity determined by $\mathbf{Q} \succeq 0$
•SVM is convex QP: The SVM primal has $\mathbf{Q} = \text{diag}(\mathbf{I}_d, 0)$, guaranteeing convexity
•KKT conditions: Provide necessary and sufficient optimality conditions for the convex case
•Interior point methods: Polynomial-time with reliable convergence, preferred for medium problems
•Active set methods: Exploit sparsity, excellent for warm-starting, exponential worst-case
•SVM sparsity: Only support vectors have active constraints, enabling efficient prediction
•Practical complexity: General QP is $O(n^{2.5})$; specialized SVM methods achieve $O(n^2)$ typical case
•Numerical care: Feature scaling is essential; watch for ill-conditioning and infeasibility

What's Next

The Story So Far:

$$\text{Margin concept} \xrightarrow{\text{Page 0}} \text{QP formulation} \xrightarrow{\text{This page}} \text{QP solver theory} \xrightarrow{\text{Next}} \text{Lagrangian duality}$$