Loading learning content...
We've developed a powerful framework: Lagrange multipliers for equality constraints, KKT conditions for inequalities, duality theory for alternative formulations. Now we see this machinery in its most celebrated application: Support Vector Machines.
SVMs are not just a classification algorithm—they're a masterclass in how constrained optimization illuminates machine learning. Every concept we've learned manifests in SVMs:
By the end of this page, you will understand how support vector machines work from the optimization perspective, why the kernel trick is possible, how the SMO algorithm solves the dual, and the deeper insights that constrained optimization provides about SVM behavior and generalization.
Let's consolidate the complete SVM formulation with our optimization lens.
The Maximum Margin Principle:
Given linearly separable training data ${(\mathbf{x}i, y_i)}{i=1}^n$ with $y_i \in {-1, +1}$, we seek the hyperplane that:
Why Maximize Margin?
Margin maximization isn't arbitrary—it has deep theoretical justification:
The margin for correctly classified points is $\frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{|\mathbf{w}|}$.
The minimum margin over all points is maximized when we solve:
$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$$
The constraint y_i(w'x_i + b) ≥ 1 (rather than ≥ some ε) is a convenient normalization. Since w and b can be scaled arbitrarily, we fix the scale by requiring the margin to be exactly 1/||w||. This transforms the max-margin problem into the canonical form with ||w||² minimization.
Standard Form:
$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, ; i = 1, \ldots, n$$
The Complete Framework:
| Component | Role |
|---|---|
| Objective $\frac{1}{2}|\mathbf{w}|^2$ | Maximize margin (equivalent to minimize norm) |
| Constraints $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ | Ensure correct classification with margin |
| Lagrangian | Encode constraints via multipliers $\alpha_i$ |
| Dual | Reveal inner product structure for kernels |
The KKT conditions completely characterize which training points influence the SVM solution.
KKT Conditions for SVM:
Stationarity: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$
Primal Feasibility: $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
Dual Feasibility: $\alpha_i \geq 0$
Complementary Slackness: $\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1) = 0$
The Support Vector Dichotomy:
From complementary slackness, for each point $i$:
This sparsity is profound: the entire decision boundary is determined by a small subset of training points. Remove any non-support vector, retrain, and you get the exact same classifier. This makes SVMs robust, interpretable, and memory-efficient for prediction.
The dual formulation reveals that data enters only through inner products. This observation unlocks nonlinear classification through kernels.
In the SVM Dual:
$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \underbrace{\mathbf{x}_i^T \mathbf{x}j}{\text{inner product}}$$
In Prediction:
$$f(\mathbf{x}) = \sum_i \alpha_i y_i \underbrace{\mathbf{x}i^T \mathbf{x}}{\text{inner product}} + b$$
The Kernel Substitution:
Replace every inner product $\mathbf{x}_i^T \mathbf{x}_j$ with a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$:
$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$$
$$f(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$
A function k(x, x') is a valid kernel if it can be written as k(x, x') = φ(x)ᵀφ(x') for some feature mapping φ. Equivalently, k is valid if the kernel matrix K with K_ij = k(x_i, x_j) is positive semi-definite for all data. This is Mercer's theorem.
| Kernel | Definition | Feature Space |
|---|---|---|
| Linear | $k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T\mathbf{x}'$ | Original space |
| Polynomial | $k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T\mathbf{x}' + c)^d$ | Polynomial features up to degree $d$ |
| RBF (Gaussian) | $k(\mathbf{x}, \mathbf{x}') = \exp(-\gamma|\mathbf{x} - \mathbf{x}'|^2)$ | Infinite-dimensional |
| Sigmoid | $k(\mathbf{x}, \mathbf{x}') = \tanh(\alpha\mathbf{x}^T\mathbf{x}' + c)$ | Neural network-like |
Why This Is Magical:
The RBF kernel corresponds to mapping data into an infinite-dimensional feature space. Computing $\phi(\mathbf{x})$ explicitly is impossible, but computing $k(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^T\phi(\mathbf{x}')$ is cheap (just the kernel formula).
We're effectively running a linear classifier in infinite dimensions, but with computational cost proportional to the number of training samples—not the feature dimension.
This is only possible because duality expresses everything through inner products.
The SVM dual is a quadratic program (QP). For large datasets, standard QP solvers are too slow. Sequential Minimal Optimization (SMO) is a breakthrough algorithm designed specifically for SVMs.
The Key Insight:
SMO decomposes the large QP into the smallest possible subproblems. Due to the constraint $\sum_i \alpha_i y_i = 0$, we can't update a single $\alpha_i$ while maintaining feasibility. The minimum useful update involves two variables at a time.
SMO Procedure:
The constraint Σᵢ αᵢyᵢ = 0 creates a coupling between all αᵢ. If you change one α, you must compensate elsewhere. Two is the minimum: change αᵢ and αⱼ such that αᵢyᵢ + αⱼyⱼ stays constant. This 2-variable subproblem has an elegant closed-form solution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as np def smo_update_pair(alpha, y, K, i, j, C): """ SMO update for two variables alpha[i] and alpha[j] The 2-variable subproblem has a beautiful closed-form solution. Parameters: - alpha: current alpha values - y: labels (+1 or -1) - K: kernel matrix K[i,j] = k(x_i, x_j) - i, j: indices of alphas to update - C: regularization bound (for soft-margin; use inf for hard-margin) Returns: updated alpha[i], alpha[j] """ # Current decision function values def f(idx): return np.sum(alpha * y * K[idx, :]) - y[idx] # simplified; b handled separately E_i = f(i) - y[i] # Error on point i E_j = f(j) - y[j] # Error on point j # Compute bounds for alpha[j] # From constraint: alpha[i]*y[i] + alpha[j]*y[j] = constant if y[i] != y[j]: L = max(0, alpha[j] - alpha[i]) H = min(C, C + alpha[j] - alpha[i]) else: L = max(0, alpha[i] + alpha[j] - C) H = min(C, alpha[i] + alpha[j]) if L >= H: return alpha[i], alpha[j] # No update possible # Second derivative of objective w.r.t. alpha[j] eta = 2 * K[i, j] - K[i, i] - K[j, j] if eta >= 0: return alpha[i], alpha[j] # Unusual case; skip # Unconstrained update for alpha[j] alpha_j_new = alpha[j] - y[j] * (E_i - E_j) / eta # Clip to bounds alpha_j_new = np.clip(alpha_j_new, L, H) # Update alpha[i] to maintain constraint alpha_i_new = alpha[i] + y[i] * y[j] * (alpha[j] - alpha_j_new) return alpha_i_new, alpha_j_new def simple_smo_demo(): """ Demonstrate the SMO update on a toy example """ print("SMO Update Demo") print("=" * 60) # Simple dataset X = np.array([ [2, 2], [2, 3], [3, 2], # Positive class [-1, -1], [-2, -1], [-1, -2] # Negative class ]) y = np.array([1, 1, 1, -1, -1, -1]) n = len(y) # Linear kernel K = X @ X.T # Initialize alphas alpha = np.zeros(n) C = float('inf') # Hard margin print("Initial state:") print(f" alpha = {alpha}") print(f" Σαᵢyᵢ = {np.sum(alpha * y)}") # Perform a few SMO updates pairs_to_update = [(0, 3), (1, 4), (2, 5)] for iteration, (i, j) in enumerate(pairs_to_update): print(f"\nIteration {iteration + 1}: updating α[{i}] and α[{j}]") alpha[i], alpha[j] = smo_update_pair(alpha, y, K, i, j, C) print(f" α[{i}] = {alpha[i]:.4f}, α[{j}] = {alpha[j]:.4f}") print(f" Σαᵢyᵢ = {np.sum(alpha * y):.4f} (should be 0)") print(f"\nFinal alpha: {alpha}") print(f"Support vectors (α > 0): {np.where(alpha > 1e-6)[0]}") return alpha simple_smo_demo()SMO Efficiency:
SMO made SVMs practical for large-scale problems and remains the basis for many SVM implementations (e.g., LIBSVM).
The dual solution gives us $\boldsymbol{\alpha}^*$, but we also need the bias term $b$ for predictions.
Recovering $b$ from Support Vectors:
For any support vector $i$ with $0 < \alpha_i < C$ (for soft-margin; any $\alpha_i > 0$ for hard-margin):
$$y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$$
Solving for $b$:
$$b = y_i - \mathbf{w}^T\mathbf{x}_i = y_i - \sum_j \alpha_j y_j \mathbf{x}_j^T\mathbf{x}_i = y_i - \sum_j \alpha_j y_j k(\mathbf{x}_j, \mathbf{x}_i)$$
In practice, average over all support vectors for numerical stability:
$$b = \frac{1}{|S|} \sum_{i \in S} \left( y_i - \sum_j \alpha_j y_j k(\mathbf{x}_j, \mathbf{x}_i) \right)$$
where $S$ is the set of support vector indices.
For soft-margin SVMs, points with α = C are on the wrong side of the margin or misclassified. They don't satisfy the equation y(w'x + b) = 1, so we can't use them to compute b. Only 'free' support vectors (0 < α < C) are exactly on the margin and provide reliable b estimates.
Making Predictions:
For a new point $\mathbf{x}$:
$$f(\mathbf{x}) = \sum_{i \in \text{SV}} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$
$$\hat{y} = \text{sign}(f(\mathbf{x}))$$
Computational Efficiency:
For real-world data that isn't perfectly separable, soft-margin SVMs allow controlled violations.
The Soft-Margin Primal:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$
Interpreting the Slack Variables $\xi_i$:
| Value of $\xi_i$ | Interpretation |
|---|---|
| $\xi_i = 0$ | Point is correctly classified with full margin |
| $0 < \xi_i < 1$ | Point is correctly classified but within margin |
| $\xi_i = 1$ | Point is exactly on the decision boundary |
| $\xi_i > 1$ | Point is misclassified |
The Role of $C$:
In regularized risk frameworks, the objective is often written as: Loss + λ·Regularizer. The SVM formulation (1/2)||w||² + C·Σξᵢ is equivalent with λ = 1/(2C). Larger C means less regularization (fitting training data more closely), while smaller C means more regularization (smoother decision boundary).
| α Value | ξ Value | Constraint Status | Location |
|---|---|---|---|
| $\alpha_i = 0$ | $\xi_i = 0$ | Inactive | Outside margin, correct |
| $0 < \alpha_i < C$ | $\xi_i = 0$ | Active, margin | Exactly on margin |
| $\alpha_i = C$ | $\xi_i > 0$ | Active, violated | Within margin or misclassified |
The optimization perspective on SVMs connects directly to generalization bounds from statistical learning theory.
Margin-Based Generalization Bound:
For a linear classifier with margin $\gamma$ on training data:
$$\text{Generalization Error} \leq O\left(\frac{R^2}{\gamma^2 n} + \sqrt{\frac{\log(1/\delta)}{n}}\right)$$
where $R$ is the radius of the data and the bound holds with probability $1-\delta$.
Key Insights:
Kernel methods map to potentially infinite-dimensional spaces, yet often don't overfit. The margin-based bound explains this: what matters isn't the dimension of the feature space, but the margin in that space. If data is well-separated after the kernel mapping, generalization can be good regardless of dimensionality.
The Representer Theorem:
A deep result connecting SVMs to regularization:
For any problem of the form $\min_f \sum_i L(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}}^2$ where $\mathcal{H}$ is a reproducing kernel Hilbert space, the solution has the form:
$$f^*(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x})$$
This theorem says the optimal solution is a weighted combination of kernel evaluations at training points—exactly the form produced by the SVM dual. The dual formulation naturally finds the optimal representiation in kernel space.
Understanding the optimization theory helps make better practical decisions.
Hyperparameter Selection:
| Hyperparameter | Trade-off | Selection Method |
|---|---|---|
| $C$ | Bias vs. variance | Cross-validation, grid search |
| $\gamma$ (RBF) | Smoothness vs. flexibility | Cross-validation, heuristics (e.g., $1/d$) |
| $d$ (polynomial) | Expressiveness vs. overfitting | Problem-dependent |
Feature Scaling:
SVMs are sensitive to feature scales because:
Always standardize features (zero mean, unit variance) before SVM training.
Computational Complexity:
| Operation | Time Complexity | Notes |
|---|---|---|
| SMO training | O(n² × iterations) | Iterations depend on data difficulty |
| Kernel matrix | O(n²) | Dominates for small n |
| Prediction (per point) | O( | SV |
| Storage | O( | SV |
When to Use SVMs:
✅ Small to medium datasets (hundreds to tens of thousands) ✅ High-dimensional data (especially with kernels) ✅ Need for interpretability (support vectors) ✅ Binary classification (extensions exist for multiclass)
❌ Very large datasets (neural networks scale better) ❌ When probability estimates are critical (SVM outputs are distances, not probabilities)
We've seen how every concept from constrained optimization manifests in SVMs:
| Concept | SVM Manifestation |
|---|---|
| Lagrange multipliers | α values become support vector weights |
| KKT conditions | Explain support vector sparsity |
| Complementary slackness | Points either contribute (α > 0) or don't (α = 0) |
| Duality | Enables kernel trick, reveals inner product structure |
| Strong duality | Primal and dual optima match (convex problem) |
| Dual feasibility (α ≥ 0) | Combined with box constraints (α ≤ C) for soft-margin |
Module Complete:
You've now mastered constrained optimization for machine learning. From Lagrange multipliers to KKT conditions, from duality theory to practical SVM algorithms, you understand how constraints shape solutions and how the dual perspective reveals hidden structure.
This knowledge is foundational: whenever you encounter regularization (constraint on model complexity), fairness constraints, resource limitations, or any optimization with restrictions, the tools of this module apply. Constrained optimization isn't just a technique—it's a way of thinking about learning systems that must satisfy real-world requirements.
Congratulations! You've completed the Constrained Optimization module. You understand Lagrange multipliers, KKT conditions, duality theory, and their crown application in SVMs. These concepts underpin much of modern machine learning optimization and provide the mathematical maturity to understand advanced topics like semidefinite programming, ADMM, and convex relaxations.