Loading content...
In the previous page, we derived the hard margin SVM as a convex quadratic program (QP). But what exactly is quadratic programming, and why does it matter that SVM falls into this class?
Quadratic programming represents one of the most important and well-studied optimization problem classes. It sits at the intersection of linear and nonlinear optimization, inheriting the tractability of linear programs while providing the expressiveness needed for problems like SVM.
This page provides a rigorous treatment of QP theory and methods, with particular focus on how these apply to SVM. Understanding QP is essential for:
By the end of this page, you will: (1) understand the general QP problem structure, (2) know the conditions that make QPs convex and tractable, (3) understand major solution approaches including interior point and active set methods, (4) recognize the special structure of the SVM QP, and (5) appreciate computational complexity considerations.
A quadratic program (QP) is an optimization problem with a quadratic objective function and linear constraints. The general form is:
$$\begin{aligned} \min_{\mathbf{z} \in \mathbb{R}^p} \quad & \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \ \text{subject to} \quad & \mathbf{A}{eq}\mathbf{z} = \mathbf{b}{eq} \ & \mathbf{A}{ineq}\mathbf{z} \leq \mathbf{b}{ineq} \end{aligned}$$
where:
The factor of $\frac{1}{2}$ in the objective is a convention that simplifies notation. With this convention, the gradient of the objective is simply $\nabla f(\mathbf{z}) = \mathbf{Q}\mathbf{z} + \mathbf{c}$. Without it, we'd have an extra factor of 2 in all derivative calculations.
Components of the Objective:
The objective function $f(\mathbf{z}) = \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z}$ has two parts:
Expanding the quadratic term: $$\frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} = \frac{1}{2}\sum_{i=1}^p\sum_{j=1}^p Q_{ij}z_i z_j$$
Relationship to Other Problem Classes:
| Problem Class | Objective | Constraints | Complexity |
|---|---|---|---|
| LP | Linear | Linear | Polynomial (Simplex/IP) |
| Convex QP | Convex Quadratic | Linear | Polynomial (IP) |
| Non-convex QP | General Quadratic | Linear | NP-hard |
| Convex NLP | Convex | Convex | Polynomial (IP) |
| General NLP | Smooth | Smooth | NP-hard in general |
The matrix $\mathbf{Q}$ determines whether the QP is convex—and convexity is everything for tractability.
Convexity of the Objective:
A function $f$ is convex if its Hessian (matrix of second derivatives) is positive semidefinite everywhere. For the QP objective:
$$f(\mathbf{z}) = \frac{1}{2}\mathbf{z}^\top \mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z}$$
The Hessian is: $$\nabla^2 f(\mathbf{z}) = \mathbf{Q}$$
Since $\mathbf{Q}$ is constant (doesn't depend on $\mathbf{z}$), the function is convex if and only if $\mathbf{Q}$ is positive semidefinite (PSD), denoted $\mathbf{Q} \succeq \mathbf{0}$.
Definition of Positive Semidefinite:
A symmetric matrix $\mathbf{Q}$ is positive semidefinite if: $$\mathbf{z}^\top \mathbf{Q}\mathbf{z} \geq 0 \quad \forall \mathbf{z} \in \mathbb{R}^p$$
Equivalently, all eigenvalues of $\mathbf{Q}$ are non-negative.
A QP is convex if and only if $\mathbf{Q} \succeq \mathbf{0}$ (positive semidefinite). When $\mathbf{Q} \succ \mathbf{0}$ (positive definite, all eigenvalues strictly positive), the QP is strictly convex, guaranteeing a unique global optimum.
Why Convexity Matters:
| Property | Convex QP | Non-convex QP |
|---|---|---|
| Local = Global | Yes | No |
| Unique solution | Yes (if strictly convex) | Possibly multiple |
| Complexity | Polynomial | NP-hard |
| Solver behavior | Reliable | May find local optima |
For SVM, What Is Q?
Recall the SVM primal: $$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 = \frac{1}{2}\mathbf{w}^\top\mathbf{w}$$
With decision variable $\mathbf{z} = (\mathbf{w}, b)^\top \in \mathbb{R}^{d+1}$:
$$\mathbf{Q} = \begin{pmatrix} \mathbf{I}_d & \mathbf{0} \ \mathbf{0}^\top & 0 \end{pmatrix}$$
This matrix has $d$ eigenvalues equal to 1 and one eigenvalue equal to 0 (corresponding to $b$). Since all eigenvalues are $\geq 0$, $\mathbf{Q}$ is positive semidefinite, confirming the SVM primal is a convex QP.
Note: The zero eigenvalue means $\mathbf{Q}$ is not strictly positive definite, so the objective is convex but not strictly convex in $(\mathbf{w}, b)$. However, it is strictly convex in $\mathbf{w}$, which is sufficient for uniqueness of $\mathbf{w}^*$.
The Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for optimality in convex QPs. Understanding KKT is crucial because it forms the theoretical basis for most QP solution methods.
The KKT System for QP:
For a QP with only inequality constraints (like SVM primal): $$\begin{aligned} \min_\mathbf{z} \quad & \frac{1}{2}\mathbf{z}^\top\mathbf{Q}\mathbf{z} + \mathbf{c}^\top\mathbf{z} \ \text{s.t.} \quad & \mathbf{a}_i^\top\mathbf{z} \leq b_i, \quad i = 1,\ldots,m \end{aligned}$$
The KKT conditions are:
1. Stationarity: $$\mathbf{Q}\mathbf{z} + \mathbf{c} + \sum_{i=1}^m \lambda_i \mathbf{a}_i = \mathbf{0}$$
2. Primal Feasibility: $$\mathbf{a}_i^\top\mathbf{z} \leq b_i \quad \forall i$$
3. Dual Feasibility: $$\lambda_i \geq 0 \quad \forall i$$
4. Complementary Slackness: $$\lambda_i(\mathbf{a}_i^\top\mathbf{z} - b_i) = 0 \quad \forall i$$
Complementary slackness says: for each constraint, either $\lambda_i = 0$ (multiplier is zero) or $\mathbf{a}_i^\top\mathbf{z} = b_i$ (constraint is active/tight). This means multipliers can only be positive for binding constraints. In SVM terms: only support vectors have non-zero multipliers.
Why KKT Gives Necessary and Sufficient Conditions:
For convex optimization problems satisfying constraint qualification (e.g., Slater's condition: a strictly feasible point exists):
This bidirectional relationship means solving the KKT system is equivalent to solving the optimization problem.
Matrix Form of KKT:
For the QP with equality constraints $\mathbf{A}\mathbf{z} = \mathbf{b}$, KKT becomes a linear system:
$$\begin{pmatrix} \mathbf{Q} & \mathbf{A}^\top \ \mathbf{A} & \mathbf{0} \end{pmatrix} \begin{pmatrix} \mathbf{z} \ \boldsymbol{\lambda} \end{pmatrix} = \begin{pmatrix} -\mathbf{c} \ \mathbf{b} \end{pmatrix}$$
This is a $(p + m) \times (p + m)$ linear system called the KKT matrix system. When $\mathbf{Q} \succ 0$, this system has a unique solution.
For Inequality Constraints:
With inequalities, KKT is not a simple linear system due to complementary slackness. Solution methods must handle the combinatorial aspect of determining which constraints are active.
Interior point methods (IPMs) are the most theoretically elegant and practically efficient algorithms for convex QPs. They approach KKT conditions by following a path through the interior of the feasible region.
The Central Path Idea:
Instead of handling complementary slackness exactly, IPMs relax it with a barrier parameter $\mu > 0$:
$$\lambda_i s_i = \mu \quad \forall i$$
where $s_i = b_i - \mathbf{a}_i^\top\mathbf{z} \geq 0$ is the slack variable.
As $\mu \to 0$, the relaxed condition approaches exact complementarity. IPMs follow this central path from a large $\mu$ (easy problem) to $\mu \to 0$ (original problem).
Interior point methods achieve $\epsilon$-optimality in $O(\sqrt{m}\log(1/\epsilon))$ iterations, where $m$ is the number of constraints. Each iteration requires solving a linear system of size $(p + m)$, typically $O((p+m)^3)$ operations. This gives polynomial-time complexity for convex QPs.
Primal-Dual Interior Point Method:
The most common variant simultaneously updates primal variables $\mathbf{z}$ and dual variables $\boldsymbol{\lambda}$:
Algorithm Outline:
Properties:
Active set methods represent a different philosophy: they explicitly track which inequality constraints are active (tight) and solve a sequence of equality-constrained subproblems.
Core Idea:
At the optimal solution, some constraints are active ($\mathbf{a}_i^\top\mathbf{z}^* = b_i$) and others are inactive ($\mathbf{a}_i^\top\mathbf{z}^* < b_i$). If we knew which constraints were active, we could:
The challenge is we don't know the active set in advance. Active set methods iteratively discover it.
Algorithm Outline:
In the worst case, active set methods may require examining all $2^m$ possible active sets (exponential). In practice, they typically perform well, especially when started near the solution (warm starting). However, they lack the polynomial-time guarantee of interior point methods.
Advantages of Active Set Methods:
When Active Set Excels:
Comparison:
| Aspect | Interior Point | Active Set |
|---|---|---|
| Worst-case complexity | Polynomial | Exponential |
| Typical behavior | Predictable | Problem-dependent |
| Warm starting | Limited benefit | Excellent |
| Solution exactness | Approximate | Exact |
| Very large problems | Preferred | Impractical |
The SVM optimization problem has special structure that both enables and motivates specialized solution methods.
The SVM Primal as Standard QP:
Recall the hard margin SVM primal: $$\begin{aligned} \min_{\mathbf{w}, b} \quad & \frac{1}{2}|\mathbf{w}|^2 \ \text{s.t.} \quad & y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1, \quad i = 1,\ldots,n \end{aligned}$$
Casting this as standard QP with $\mathbf{z} = (\mathbf{w}^\top, b)^\top \in \mathbb{R}^{d+1}$:
$$\mathbf{Q} = \begin{pmatrix} \mathbf{I}_d & \mathbf{0}_d \ \mathbf{0}_d^\top & 0 \end{pmatrix}, \quad \mathbf{c} = \mathbf{0}$$
Constraints: $-y_i(\mathbf{w}^\top\mathbf{x}i + b) \leq -1$, giving: $$\mathbf{A}{ineq} = \begin{pmatrix} -y_1\mathbf{x}_1^\top & -y_1 \ \vdots & \vdots \ -y_n\mathbf{x}n^\top & -y_n \end{pmatrix}, \quad \mathbf{b}{ineq} = -\mathbf{1}$$
The SVM primal QP has: (1) Diagonal $\mathbf{Q}$ (identity block plus zero), making the objective separable in components of $\mathbf{w}$; (2) Dense constraint matrix when features are dense; (3) $n$ constraints and $d+1$ variables; (4) Positive semidefinite $\mathbf{Q}$ with one zero eigenvalue.
Problem Dimensions:
| Quantity | Symbol | Meaning |
|---|---|---|
| Variables | $d + 1$ | Feature dimension plus bias |
| Constraints | $n$ | One per training example |
| Active constraints at optimum | $n_{SV}$ | Number of support vectors |
When Is Primal vs Dual Preferred?
SVM-Specific Sparsity:
A remarkable property of SVMs: at the optimum, most constraints are inactive. Only support vectors have active constraints. This means:
Understanding the computational complexity of solving the SVM QP is crucial for practical deployment.
General QP Complexity:
For a QP with $p$ variables and $m$ inequality constraints:
For SVM Primal ($p = d+1$, $m = n$):
Interior Point complexity: $O(\sqrt{n}(d^3 + d n^2))$
The constraint matrix $\mathbf{A}_{ineq} \in \mathbb{R}^{n \times (d+1)}$ requires $O(nd)$ storage. For large-scale SVM with millions of examples and thousands of features, this matrix alone requires gigabytes of memory. This motivates specialized algorithms that avoid forming the full matrix.
Practical Complexity:
Real-world performance often differs from worst-case bounds:
| Problem Size | Interior Point | Specialized SVM |
|---|---|---|
| Small ($n < 1000$) | Fast, reliable | Overkill |
| Medium ($n \sim 10^4$) | Feasible | Preferred |
| Large ($n \sim 10^5$) | Memory issues | Required |
| Very Large ($n > 10^6$) | Impractical | Approximate methods |
Why Specialized Algorithms?
General QP solvers don't exploit SVM structure:
This motivates algorithms like:
| Method | Time Complexity | Space Complexity | Best For |
|---|---|---|---|
| Interior Point (Primal) | $O(n^{2.5}d)$ | $O(nd)$ | Small to medium $n$ |
| Interior Point (Dual) | $O(n^{3.5})$ | $O(n^2)$ | Small $n$, kernels |
| SMO | $O(n^2 d)$ typical | $O(n)$ | Large $n$, sparse SVs |
| Stochastic/Online | $O(nd)$ per pass | $O(d)$ | Very large $n$, approximate |
Understanding available QP solvers helps in selecting the right tool for your SVM implementation.
Commercial Solvers:
| Solver | Method | Strengths |
|---|---|---|
| CPLEX | IPM, Simplex | Industrial strength, excellent support |
| Gurobi | IPM, Simplex | Very fast, good APIs |
| MOSEK | IPM | Specializes in conic optimization |
These solvers are highly optimized but require licenses for commercial use.
Open Source Solvers:
| Solver | Method | Strengths |
|---|---|---|
| OSQP | IPM (first-order) | Fast for medium problems |
| CVXOPT | IPM | Python-friendly |
| qpOASES | Active Set | Good for warm-starting |
SVM-Specific Libraries:
| Library | Algorithm | Notes |
|---|---|---|
| LIBSVM | SMO variant | Gold standard for SVM |
| LIBLINEAR | Dual Coordinate Descent | For linear SVM, very fast |
| scikit-learn | Uses LIBSVM/LIBLINEAR | Easy-to-use Python interface |
For production SVM: use LIBSVM or scikit-learn. For custom QP formulations: use OSQP or CVXOPT. For research/understanding: implement basic algorithms yourself. For enterprise: consider commercial solvers if performance is critical.
Solver Selection Criteria:
Example: Solving SVM with Python
# Using scikit-learn (wraps LIBSVM)
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1.0)
clf.fit(X_train, y_train)
# Using CVXOPT for custom QP
from cvxopt import matrix, solvers
Q = matrix(np.eye(d+1), tc='d')
c = matrix(np.zeros(d+1), tc='d')
G = matrix(-y[:, None] * np.c_[X, np.ones(n)], tc='d')
h = matrix(-np.ones(n), tc='d')
sol = solvers.qp(Q, c, G, h)
For practical SVM, prefer specialized libraries. Only use general QP solvers when you need custom formulations or educational understanding.
Even convex QPs can present numerical challenges. Understanding these helps diagnose and fix solver issues.
Conditioning of the Q Matrix:
The condition number $\kappa(\mathbf{Q}) = |\mathbf{Q}| \cdot |\mathbf{Q}^{-1}|$ affects solver stability:
For SVM primal, $\mathbf{Q}$ has one zero eigenvalue (from bias $b$), making it singular. Solvers handle this, but it can cause issues with naive implementations.
Feature Scaling:
When features have vastly different scales, the optimization landscape becomes ill-conditioned:
| Issue | Symptom | Solution |
|---|---|---|
| Features on different scales | Slow convergence | Standardize features |
| Very large feature values | Numerical overflow | Normalize |
| Very small feature values | Underflow, precision loss | Scale up |
Standard preprocessing: Standardize each feature to zero mean and unit variance.
SVM performance is highly sensitive to feature scaling because the margin is measured in the feature space. Features with larger magnitude will dominate the margin calculation. Always standardize: $x'_j = (x_j - \mu_j) / \sigma_j$ for each feature $j$.
Infeasibility Detection:
When data is not linearly separable, the hard margin SVM QP is infeasible:
Diagnosis:
Solution: Use soft margin SVM with slack variables.
Numerical Tolerance:
Solvers use tolerances for:
Default tolerances (typically $10^{-6}$ to $10^{-8}$) are usually appropriate. Tighter tolerances increase computation time; looser tolerances may yield suboptimal solutions.
We have now thoroughly examined quadratic programming as the computational foundation for hard margin SVM. Let's consolidate the key insights:
With QP understood, we're ready to derive the SVM solution using Lagrangian methods. The next page introduces Lagrangian duality—a powerful framework that transforms the primal QP into an equivalent dual problem. This dual perspective reveals why only support vectors matter and enables the famous kernel trick.
The Story So Far:
$$\text{Margin concept} \xrightarrow{\text{Page 0}} \text{QP formulation} \xrightarrow{\text{This page}} \text{QP solver theory} \xrightarrow{\text{Next}} \text{Lagrangian duality}$$
We've established that SVM is a well-posed optimization problem with guaranteed unique solution, solvable in polynomial time. The Lagrangian approach will reveal the underlying structure more deeply, showing why SVM is not just solvable, but elegantly so.