Machine LearningConstrained Optimization

Constrained Optimization

LevelIntermediate

Duration90 mins

TopicConstrained Optimization

5 / 5

Applications: Support Vector Machines

Constrained Optimization in Action

We've developed a powerful framework: Lagrange multipliers for equality constraints, KKT conditions for inequalities, duality theory for alternative formulations. Now we see this machinery in its most celebrated application: Support Vector Machines.

SVMs are not just a classification algorithm—they're a masterclass in how constrained optimization illuminates machine learning. Every concept we've learned manifests in SVMs:

Lagrangians encode the margin maximization objective
KKT conditions explain why only support vectors matter
Duality enables the kernel trick and efficient algorithms
Complementary slackness creates sparse, interpretable solutions

What You Will Learn

By the end of this page, you will understand how support vector machines work from the optimization perspective, why the kernel trick is possible, how the SMO algorithm solves the dual, and the deeper insights that constrained optimization provides about SVM behavior and generalization.

The SVM Optimization Problem Revisited

Let's consolidate the complete SVM formulation with our optimization lens.

The Maximum Margin Principle:

Given linearly separable training data ${(\mathbf{x}i, y_i)}{i=1}^n$ with $y_i \in {-1, +1}$, we seek the hyperplane that:

Correctly classifies all points: $y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0$
Maximizes the minimum distance (margin) to the hyperplane

Why Maximize Margin?

Margin maximization isn't arbitrary—it has deep theoretical justification:

Geometric intuition: Larger margins provide a "buffer zone" for test points
Statistical learning theory: VC-dimension bounds show generalization improves with larger margins
Regularization: Minimizing $|\mathbf{w}|$ is equivalent to L2 regularization

The margin for correctly classified points is $\frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{|\mathbf{w}|}$.

The minimum margin over all points is maximized when we solve:

$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$$

The Scaling Convention

The constraint y_i(w'x_i + b) ≥ 1 (rather than ≥ some ε) is a convenient normalization. Since w and b can be scaled arbitrarily, we fix the scale by requiring the margin to be exactly 1/||w||. This transforms the max-margin problem into the canonical form with ||w||² minimization.

Standard Form:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, ; i = 1, \ldots, n$$

The Complete Framework:

Component	Role
Objective $\frac{1}{2}\|\mathbf{w}\|^2$	Maximize margin (equivalent to minimize norm)
Constraints $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$	Ensure correct classification with margin
Lagrangian	Encode constraints via multipliers $\alpha_i$
Dual	Reveal inner product structure for kernels

Understanding Support Vectors Through KKT

The KKT conditions completely characterize which training points influence the SVM solution.

KKT Conditions for SVM:

Stationarity: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$
Primal Feasibility: $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
Dual Feasibility: $\alpha_i \geq 0$
Complementary Slackness: $\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1) = 0$

The Support Vector Dichotomy:

From complementary slackness, for each point $i$:

Either $\alpha_i = 0$ (point doesn't contribute to $\mathbf{w}$)
Or $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ (point lies exactly on the margin)

Support Vectors (αᵢ > 0)

•Lie exactly on the margin boundary
•Satisfy yᵢ(w'xᵢ + b) = 1 exactly
•Contribute to w = Σαᵢyᵢxᵢ
•Determine the decision boundary
•Typically a small subset of training data

Non-Support Vectors (αᵢ = 0)

•Lie strictly outside the margin
•Satisfy yᵢ(w'xᵢ + b) > 1 strictly
•Don't contribute to w at all
•Can be removed without changing solution
•Typically the majority of training data

The Sparsity Insight

This sparsity is profound: the entire decision boundary is determined by a small subset of training points. Remove any non-support vector, retrain, and you get the exact same classifier. This makes SVMs robust, interpretable, and memory-efficient for prediction.

The Kernel Trick: Duality Enables Nonlinearity

The dual formulation reveals that data enters only through inner products. This observation unlocks nonlinear classification through kernels.

In the SVM Dual:

$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \underbrace{\mathbf{x}_i^T \mathbf{x}j}{\text{inner product}}$$

In Prediction:

$$f(\mathbf{x}) = \sum_i \alpha_i y_i \underbrace{\mathbf{x}i^T \mathbf{x}}{\text{inner product}} + b$$

The Kernel Substitution:

Replace every inner product $\mathbf{x}_i^T \mathbf{x}_j$ with a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$:

$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$$

$$f(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$

What Makes a Valid Kernel?

A function k(x, x') is a valid kernel if it can be written as k(x, x') = φ(x)ᵀφ(x') for some feature mapping φ. Equivalently, k is valid if the kernel matrix K with K_ij = k(x_i, x_j) is positive semi-definite for all data. This is Mercer's theorem.

Common Kernel Functions
Kernel	Definition	Feature Space
Linear	$k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T\mathbf{x}'$	Original space
Polynomial	$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T\mathbf{x}' + c)^d$	Polynomial features up to degree $d$
RBF (Gaussian)	$k(\mathbf{x}, \mathbf{x}') = \exp(-\gamma\|\mathbf{x} - \mathbf{x}'\|^2)$	Infinite-dimensional
Sigmoid	$k(\mathbf{x}, \mathbf{x}') = \tanh(\alpha\mathbf{x}^T\mathbf{x}' + c)$	Neural network-like

Why This Is Magical:

The RBF kernel corresponds to mapping data into an infinite-dimensional feature space. Computing $\phi(\mathbf{x})$ explicitly is impossible, but computing $k(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^T\phi(\mathbf{x}')$ is cheap (just the kernel formula).

We're effectively running a linear classifier in infinite dimensions, but with computational cost proportional to the number of training samples—not the feature dimension.

This is only possible because duality expresses everything through inner products.

Solving the Dual: The SMO Algorithm

The SVM dual is a quadratic program (QP). For large datasets, standard QP solvers are too slow. Sequential Minimal Optimization (SMO) is a breakthrough algorithm designed specifically for SVMs.

The Key Insight:

SMO decomposes the large QP into the smallest possible subproblems. Due to the constraint $\sum_i \alpha_i y_i = 0$, we can't update a single $\alpha_i$ while maintaining feasibility. The minimum useful update involves two variables at a time.

SMO Procedure:

Select two variables $\alpha_i$ and $\alpha_j$ to optimize (using heuristics)
Fix all other $\alpha_k$, $k \neq i, j$
Solve the resulting 2-variable QP analytically (has closed-form solution!)
Update $\alpha_i$ and $\alpha_j$
Repeat until convergence (all KKT conditions satisfied)

Why Two Variables?

The constraint Σᵢ αᵢyᵢ = 0 creates a coupling between all αᵢ. If you change one α, you must compensate elsewhere. Two is the minimum: change αᵢ and αⱼ such that αᵢyᵢ + αⱼyⱼ stays constant. This 2-variable subproblem has an elegant closed-form solution.

smo_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
 
def smo_update_pair(alpha, y, K, i, j, C):
    """
    SMO update for two variables alpha[i] and alpha[j]
    
    The 2-variable subproblem has a beautiful closed-form solution.
    
    Parameters:
    - alpha: current alpha values
    - y: labels (+1 or -1)
    - K: kernel matrix K[i,j] = k(x_i, x_j)
    - i, j: indices of alphas to update
    - C: regularization bound (for soft-margin; use inf for hard-margin)
    
    Returns: updated alpha[i], alpha[j]
    """
    
    # Current decision function values
    def f(idx):
        return np.sum(alpha * y * K[idx, :]) - y[idx]  # simplified; b handled separately
    
    E_i = f(i) - y[i]  # Error on point i
    E_j = f(j) - y[j]  # Error on point j
    
    # Compute bounds for alpha[j]
    # From constraint: alpha[i]*y[i] + alpha[j]*y[j] = constant
    if y[i] != y[j]:
        L = max(0, alpha[j] - alpha[i])
        H = min(C, C + alpha[j] - alpha[i])
    else:
        L = max(0, alpha[i] + alpha[j] - C)
        H = min(C, alpha[i] + alpha[j])
    
    if L >= H:
        return alpha[i], alpha[j]  # No update possible
    
    # Second derivative of objective w.r.t. alpha[j]
    eta = 2 * K[i, j] - K[i, i] - K[j, j]
    
    if eta >= 0:
        return alpha[i], alpha[j]  # Unusual case; skip
    
    # Unconstrained update for alpha[j]
    alpha_j_new = alpha[j] - y[j] * (E_i - E_j) / eta
    
    # Clip to bounds
    alpha_j_new = np.clip(alpha_j_new, L, H)
    
    # Update alpha[i] to maintain constraint
    alpha_i_new = alpha[i] + y[i] * y[j] * (alpha[j] - alpha_j_new)
    
    return alpha_i_new, alpha_j_new
 
 
def simple_smo_demo():
    """
    Demonstrate the SMO update on a toy example
    """
    print("SMO Update Demo")
    print("=" * 60)
    
    # Simple dataset
    X = np.array([
        [2, 2], [2, 3], [3, 2],      # Positive class
        [-1, -1], [-2, -1], [-1, -2]  # Negative class
    ])
    y = np.array([1, 1, 1, -1, -1, -1])
    n = len(y)
    
    # Linear kernel
    K = X @ X.T
    
    # Initialize alphas
    alpha = np.zeros(n)
    C = float('inf')  # Hard margin
    
    print("Initial state:")
    print(f"  alpha = {alpha}")
    print(f"  Σαᵢyᵢ = {np.sum(alpha * y)}")
    
    # Perform a few SMO updates
    pairs_to_update = [(0, 3), (1, 4), (2, 5)]
    
    for iteration, (i, j) in enumerate(pairs_to_update):
        print(f"\nIteration {iteration + 1}: updating α[{i}] and α[{j}]")
        alpha[i], alpha[j] = smo_update_pair(alpha, y, K, i, j, C)
        print(f"  α[{i}] = {alpha[i]:.4f}, α[{j}] = {alpha[j]:.4f}")
        print(f"  Σαᵢyᵢ = {np.sum(alpha * y):.4f} (should be 0)")
    
    print(f"\nFinal alpha: {alpha}")
    print(f"Support vectors (α > 0): {np.where(alpha > 1e-6)[0]}")
    
    return alpha
 
simple_smo_demo()

SMO Efficiency:

Each iteration is O(n) to compute errors, but the update itself is O(1)
Clever caching of kernel evaluations speeds up repeated computations
Heuristics for selecting $(i, j)$ pairs focus on violators of KKT conditions
Typically converges much faster than general QP solvers

SMO made SVMs practical for large-scale problems and remains the basis for many SVM implementations (e.g., LIBSVM).

Recovering the Bias and Making Predictions

The dual solution gives us $\boldsymbol{\alpha}^*$, but we also need the bias term $b$ for predictions.

Recovering $b$ from Support Vectors:

For any support vector $i$ with $0 < \alpha_i < C$ (for soft-margin; any $\alpha_i > 0$ for hard-margin):

$$y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$$

Solving for $b$:

$$b = y_i - \mathbf{w}^T\mathbf{x}_i = y_i - \sum_j \alpha_j y_j \mathbf{x}_j^T\mathbf{x}_i = y_i - \sum_j \alpha_j y_j k(\mathbf{x}_j, \mathbf{x}_i)$$

In practice, average over all support vectors for numerical stability:

$$b = \frac{1}{|S|} \sum_{i \in S} \left( y_i - \sum_j \alpha_j y_j k(\mathbf{x}_j, \mathbf{x}_i) \right)$$

where $S$ is the set of support vector indices.

Why Only Free Support Vectors?

For soft-margin SVMs, points with α = C are on the wrong side of the margin or misclassified. They don't satisfy the equation y(w'x + b) = 1, so we can't use them to compute b. Only 'free' support vectors (0 < α < C) are exactly on the margin and provide reliable b estimates.

Making Predictions:

For a new point $\mathbf{x}$:

$$f(\mathbf{x}) = \sum_{i \in \text{SV}} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$

$$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Computational Efficiency:

Store only support vectors (typically a small fraction of training data)
Prediction cost is O(|SV| × d) for linear kernel, or O(|SV|) kernel evaluations for others
Independent of total training set size after training

Soft-Margin SVM: The Complete Picture

For real-world data that isn't perfectly separable, soft-margin SVMs allow controlled violations.

The Soft-Margin Primal:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Interpreting the Slack Variables $\xi_i$:

Value of $\xi_i$	Interpretation
$\xi_i = 0$	Point is correctly classified with full margin
$0 < \xi_i < 1$	Point is correctly classified but within margin
$\xi_i = 1$	Point is exactly on the decision boundary
$\xi_i > 1$	Point is misclassified

The Role of $C$:

$C \to \infty$: Hard-margin SVM (no violations allowed)
Large $C$: Low tolerance for violations → low bias, high variance
Small $C$: High tolerance for violations → high bias, low variance

C as Inverse Regularization

In regularized risk frameworks, the objective is often written as: Loss + λ·Regularizer. The SVM formulation (1/2)||w||² + C·Σξᵢ is equivalent with λ = 1/(2C). Larger C means less regularization (fitting training data more closely), while smaller C means more regularization (smoother decision boundary).

Point Classification by α Value (Soft-Margin)
α Value	ξ Value	Constraint Status	Location
$\alpha_i = 0$	$\xi_i = 0$	Inactive	Outside margin, correct
$0 < \alpha_i < C$	$\xi_i = 0$	Active, margin	Exactly on margin
$\alpha_i = C$	$\xi_i > 0$	Active, violated	Within margin or misclassified

Generalization and Statistical Learning Theory

The optimization perspective on SVMs connects directly to generalization bounds from statistical learning theory.

Margin-Based Generalization Bound:

For a linear classifier with margin $\gamma$ on training data:

$$\text{Generalization Error} \leq O\left(\frac{R^2}{\gamma^2 n} + \sqrt{\frac{\log(1/\delta)}{n}}\right)$$

where $R$ is the radius of the data and the bound holds with probability $1-\delta$.

Key Insights:

Larger margin → better generalization: The bound decreases as $\gamma$ increases
Independent of dimension: Unlike VC-dimension bounds, margin bounds don't grow with the feature space dimension—crucial for kernels
SVM optimality: By maximizing margin, SVMs directly optimize this generalization bound

Why Kernels Don't Overfit (Usually)

Kernel methods map to potentially infinite-dimensional spaces, yet often don't overfit. The margin-based bound explains this: what matters isn't the dimension of the feature space, but the margin in that space. If data is well-separated after the kernel mapping, generalization can be good regardless of dimensionality.

The Representer Theorem:

A deep result connecting SVMs to regularization:

For any problem of the form $\min_f \sum_i L(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}}^2$ where $\mathcal{H}$ is a reproducing kernel Hilbert space, the solution has the form:

$$f^*(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x})$$

This theorem says the optimal solution is a weighted combination of kernel evaluations at training points—exactly the form produced by the SVM dual. The dual formulation naturally finds the optimal representiation in kernel space.

Practical Considerations for SVM Implementation

Understanding the optimization theory helps make better practical decisions.

Hyperparameter Selection:

Hyperparameter	Trade-off	Selection Method
$C$	Bias vs. variance	Cross-validation, grid search
$\gamma$ (RBF)	Smoothness vs. flexibility	Cross-validation, heuristics (e.g., $1/d$)
$d$ (polynomial)	Expressiveness vs. overfitting	Problem-dependent

Feature Scaling:

SVMs are sensitive to feature scales because:

The margin depends on $|\mathbf{w}|$, which is scale-dependent
Kernel evaluations (especially RBF) compare distances

Always standardize features (zero mean, unit variance) before SVM training.

Common Pitfalls

Using RBF kernel without tuning γ: Too small → underfitting, too large → overfitting. 2. Not scaling features: Can cause wildly different margins in different directions. 3. Using hard-margin on noisy data: A single outlier can cause infeasibility or poor solutions.

Computational Complexity:

Operation	Time Complexity	Notes
SMO training	O(n² × iterations)	Iterations depend on data difficulty
Kernel matrix	O(n²)	Dominates for small n
Prediction (per point)	O(	SV
Storage	O(	SV

When to Use SVMs:

✅ Small to medium datasets (hundreds to tens of thousands) ✅ High-dimensional data (especially with kernels) ✅ Need for interpretability (support vectors) ✅ Binary classification (extensions exist for multiclass)

❌ Very large datasets (neural networks scale better) ❌ When probability estimates are critical (SVM outputs are distances, not probabilities)

Summary: Constrained Optimization in SVMs

We've seen how every concept from constrained optimization manifests in SVMs:

Optimization Concepts in SVMs
Concept	SVM Manifestation
Lagrange multipliers	α values become support vector weights
KKT conditions	Explain support vector sparsity
Complementary slackness	Points either contribute (α > 0) or don't (α = 0)
Duality	Enables kernel trick, reveals inner product structure
Strong duality	Primal and dual optima match (convex problem)
Dual feasibility (α ≥ 0)	Combined with box constraints (α ≤ C) for soft-margin

Module Key Takeaways

•Lagrange multipliers convert constrained problems to unconstrained ones through the Lagrangian, with multipliers measuring constraint sensitivity.
•KKT conditions generalize Lagrange to inequality constraints, with complementary slackness creating sparsity in solutions.
•Duality provides alternative problem formulations that are always convex and reveal structural insights.
•Strong duality for convex problems means primal and dual solutions have equal objective values—solving one solves both.
•SVMs exemplify all these concepts: margin maximization as constrained optimization, support vectors from KKT, kernels from dual inner products.
•The framework extends broadly: regularized learning, maximum entropy, portfolio optimization, and countless other applications.

Module Complete:

You've now mastered constrained optimization for machine learning. From Lagrange multipliers to KKT conditions, from duality theory to practical SVM algorithms, you understand how constraints shape solutions and how the dual perspective reveals hidden structure.

This knowledge is foundational: whenever you encounter regularization (constraint on model complexity), fairness constraints, resource limitations, or any optimization with restrictions, the tools of this module apply. Constrained optimization isn't just a technique—it's a way of thinking about learning systems that must satisfy real-world requirements.

Module Complete

Congratulations! You've completed the Constrained Optimization module. You understand Lagrange multipliers, KKT conditions, duality theory, and their crown application in SVMs. These concepts underpin much of modern machine learning optimization and provide the mathematical maturity to understand advanced topics like semidefinite programming, ADMM, and convex relaxations.

5 / 5

Loading learning content...

Machine LearningConstrained Optimization

Constrained Optimization

LevelIntermediate

Duration90 mins

TopicConstrained Optimization

5 / 5

Applications: Support Vector Machines

Constrained Optimization in Action

SVMs are not just a classification algorithm—they're a masterclass in how constrained optimization illuminates machine learning. Every concept we've learned manifests in SVMs:

Lagrangians encode the margin maximization objective
KKT conditions explain why only support vectors matter
Duality enables the kernel trick and efficient algorithms
Complementary slackness creates sparse, interpretable solutions

What You Will Learn

The SVM Optimization Problem Revisited

Let's consolidate the complete SVM formulation with our optimization lens.

The Maximum Margin Principle:

Given linearly separable training data ${(\mathbf{x}i, y_i)}{i=1}^n$ with $y_i \in {-1, +1}$, we seek the hyperplane that:

Correctly classifies all points: $y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0$
Maximizes the minimum distance (margin) to the hyperplane

Why Maximize Margin?

Margin maximization isn't arbitrary—it has deep theoretical justification:

Geometric intuition: Larger margins provide a "buffer zone" for test points
Statistical learning theory: VC-dimension bounds show generalization improves with larger margins
Regularization: Minimizing $|\mathbf{w}|$ is equivalent to L2 regularization

The margin for correctly classified points is $\frac{y_i(\mathbf{w}^T\mathbf{x}_i + b)}{|\mathbf{w}|}$.

The minimum margin over all points is maximized when we solve:

$$\max_{\mathbf{w}, b} \frac{1}{|\mathbf{w}|} \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$$

The Scaling Convention

Standard Form:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, ; i = 1, \ldots, n$$

The Complete Framework:

Component	Role
Objective $\frac{1}{2}\|\mathbf{w}\|^2$	Maximize margin (equivalent to minimize norm)
Constraints $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$	Ensure correct classification with margin
Lagrangian	Encode constraints via multipliers $\alpha_i$
Dual	Reveal inner product structure for kernels

Understanding Support Vectors Through KKT

The KKT conditions completely characterize which training points influence the SVM solution.

KKT Conditions for SVM:

Stationarity: $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$
Primal Feasibility: $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$
Dual Feasibility: $\alpha_i \geq 0$
Complementary Slackness: $\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1) = 0$

The Support Vector Dichotomy:

From complementary slackness, for each point $i$:

Either $\alpha_i = 0$ (point doesn't contribute to $\mathbf{w}$)
Or $y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$ (point lies exactly on the margin)

Support Vectors (αᵢ > 0)

•Lie exactly on the margin boundary
•Satisfy yᵢ(w'xᵢ + b) = 1 exactly
•Contribute to w = Σαᵢyᵢxᵢ
•Determine the decision boundary
•Typically a small subset of training data

Non-Support Vectors (αᵢ = 0)

•Lie strictly outside the margin
•Satisfy yᵢ(w'xᵢ + b) > 1 strictly
•Don't contribute to w at all
•Can be removed without changing solution
•Typically the majority of training data

The Sparsity Insight

The Kernel Trick: Duality Enables Nonlinearity

The dual formulation reveals that data enters only through inner products. This observation unlocks nonlinear classification through kernels.

In the SVM Dual:

$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \underbrace{\mathbf{x}_i^T \mathbf{x}j}{\text{inner product}}$$

In Prediction:

$$f(\mathbf{x}) = \sum_i \alpha_i y_i \underbrace{\mathbf{x}i^T \mathbf{x}}{\text{inner product}} + b$$

The Kernel Substitution:

Replace every inner product $\mathbf{x}_i^T \mathbf{x}_j$ with a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$:

$$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)$$

$$f(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$

What Makes a Valid Kernel?

Common Kernel Functions
Kernel	Definition	Feature Space
Linear	$k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T\mathbf{x}'$	Original space
Polynomial	$k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T\mathbf{x}' + c)^d$	Polynomial features up to degree $d$
RBF (Gaussian)	$k(\mathbf{x}, \mathbf{x}') = \exp(-\gamma\|\mathbf{x} - \mathbf{x}'\|^2)$	Infinite-dimensional
Sigmoid	$k(\mathbf{x}, \mathbf{x}') = \tanh(\alpha\mathbf{x}^T\mathbf{x}' + c)$	Neural network-like

Why This Is Magical:

We're effectively running a linear classifier in infinite dimensions, but with computational cost proportional to the number of training samples—not the feature dimension.

This is only possible because duality expresses everything through inner products.

Solving the Dual: The SMO Algorithm

The SVM dual is a quadratic program (QP). For large datasets, standard QP solvers are too slow. Sequential Minimal Optimization (SMO) is a breakthrough algorithm designed specifically for SVMs.

The Key Insight:

SMO Procedure:

Select two variables $\alpha_i$ and $\alpha_j$ to optimize (using heuristics)
Fix all other $\alpha_k$, $k \neq i, j$
Solve the resulting 2-variable QP analytically (has closed-form solution!)
Update $\alpha_i$ and $\alpha_j$
Repeat until convergence (all KKT conditions satisfied)

Why Two Variables?

smo_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
 
def smo_update_pair(alpha, y, K, i, j, C):
    """
    SMO update for two variables alpha[i] and alpha[j]
    
    The 2-variable subproblem has a beautiful closed-form solution.
    
    Parameters:
    - alpha: current alpha values
    - y: labels (+1 or -1)
    - K: kernel matrix K[i,j] = k(x_i, x_j)
    - i, j: indices of alphas to update
    - C: regularization bound (for soft-margin; use inf for hard-margin)
    
    Returns: updated alpha[i], alpha[j]
    """
    
    # Current decision function values
    def f(idx):
        return np.sum(alpha * y * K[idx, :]) - y[idx]  # simplified; b handled separately
    
    E_i = f(i) - y[i]  # Error on point i
    E_j = f(j) - y[j]  # Error on point j
    
    # Compute bounds for alpha[j]
    # From constraint: alpha[i]*y[i] + alpha[j]*y[j] = constant
    if y[i] != y[j]:
        L = max(0, alpha[j] - alpha[i])
        H = min(C, C + alpha[j] - alpha[i])
    else:
        L = max(0, alpha[i] + alpha[j] - C)
        H = min(C, alpha[i] + alpha[j])
    
    if L >= H:
        return alpha[i], alpha[j]  # No update possible
    
    # Second derivative of objective w.r.t. alpha[j]
    eta = 2 * K[i, j] - K[i, i] - K[j, j]
    
    if eta >= 0:
        return alpha[i], alpha[j]  # Unusual case; skip
    
    # Unconstrained update for alpha[j]
    alpha_j_new = alpha[j] - y[j] * (E_i - E_j) / eta
    
    # Clip to bounds
    alpha_j_new = np.clip(alpha_j_new, L, H)
    
    # Update alpha[i] to maintain constraint
    alpha_i_new = alpha[i] + y[i] * y[j] * (alpha[j] - alpha_j_new)
    
    return alpha_i_new, alpha_j_new
 
 
def simple_smo_demo():
    """
    Demonstrate the SMO update on a toy example
    """
    print("SMO Update Demo")
    print("=" * 60)
    
    # Simple dataset
    X = np.array([
        [2, 2], [2, 3], [3, 2],      # Positive class
        [-1, -1], [-2, -1], [-1, -2]  # Negative class
    ])
    y = np.array([1, 1, 1, -1, -1, -1])
    n = len(y)
    
    # Linear kernel
    K = X @ X.T
    
    # Initialize alphas
    alpha = np.zeros(n)
    C = float('inf')  # Hard margin
    
    print("Initial state:")
    print(f"  alpha = {alpha}")
    print(f"  Σαᵢyᵢ = {np.sum(alpha * y)}")
    
    # Perform a few SMO updates
    pairs_to_update = [(0, 3), (1, 4), (2, 5)]
    
    for iteration, (i, j) in enumerate(pairs_to_update):
        print(f"\nIteration {iteration + 1}: updating α[{i}] and α[{j}]")
        alpha[i], alpha[j] = smo_update_pair(alpha, y, K, i, j, C)
        print(f"  α[{i}] = {alpha[i]:.4f}, α[{j}] = {alpha[j]:.4f}")
        print(f"  Σαᵢyᵢ = {np.sum(alpha * y):.4f} (should be 0)")
    
    print(f"\nFinal alpha: {alpha}")
    print(f"Support vectors (α > 0): {np.where(alpha > 1e-6)[0]}")
    
    return alpha
 
simple_smo_demo()

SMO Efficiency:

Each iteration is O(n) to compute errors, but the update itself is O(1)
Clever caching of kernel evaluations speeds up repeated computations
Heuristics for selecting $(i, j)$ pairs focus on violators of KKT conditions
Typically converges much faster than general QP solvers

SMO made SVMs practical for large-scale problems and remains the basis for many SVM implementations (e.g., LIBSVM).

Recovering the Bias and Making Predictions

The dual solution gives us $\boldsymbol{\alpha}^*$, but we also need the bias term $b$ for predictions.

Recovering $b$ from Support Vectors:

For any support vector $i$ with $0 < \alpha_i < C$ (for soft-margin; any $\alpha_i > 0$ for hard-margin):

$$y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1$$

Solving for $b$:

$$b = y_i - \mathbf{w}^T\mathbf{x}_i = y_i - \sum_j \alpha_j y_j \mathbf{x}_j^T\mathbf{x}_i = y_i - \sum_j \alpha_j y_j k(\mathbf{x}_j, \mathbf{x}_i)$$

In practice, average over all support vectors for numerical stability:

$$b = \frac{1}{|S|} \sum_{i \in S} \left( y_i - \sum_j \alpha_j y_j k(\mathbf{x}_j, \mathbf{x}_i) \right)$$

where $S$ is the set of support vector indices.

Why Only Free Support Vectors?

Making Predictions:

For a new point $\mathbf{x}$:

$$f(\mathbf{x}) = \sum_{i \in \text{SV}} \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}) + b$$

$$\hat{y} = \text{sign}(f(\mathbf{x}))$$

Computational Efficiency:

Store only support vectors (typically a small fraction of training data)
Prediction cost is O(|SV| × d) for linear kernel, or O(|SV|) kernel evaluations for others
Independent of total training set size after training

Soft-Margin SVM: The Complete Picture

For real-world data that isn't perfectly separable, soft-margin SVMs allow controlled violations.

The Soft-Margin Primal:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Interpreting the Slack Variables $\xi_i$:

Value of $\xi_i$	Interpretation
$\xi_i = 0$	Point is correctly classified with full margin
$0 < \xi_i < 1$	Point is correctly classified but within margin
$\xi_i = 1$	Point is exactly on the decision boundary
$\xi_i > 1$	Point is misclassified

The Role of $C$:

$C \to \infty$: Hard-margin SVM (no violations allowed)
Large $C$: Low tolerance for violations → low bias, high variance
Small $C$: High tolerance for violations → high bias, low variance

C as Inverse Regularization

Point Classification by α Value (Soft-Margin)
α Value	ξ Value	Constraint Status	Location
$\alpha_i = 0$	$\xi_i = 0$	Inactive	Outside margin, correct
$0 < \alpha_i < C$	$\xi_i = 0$	Active, margin	Exactly on margin
$\alpha_i = C$	$\xi_i > 0$	Active, violated	Within margin or misclassified

Generalization and Statistical Learning Theory

The optimization perspective on SVMs connects directly to generalization bounds from statistical learning theory.

Margin-Based Generalization Bound:

For a linear classifier with margin $\gamma$ on training data:

$$\text{Generalization Error} \leq O\left(\frac{R^2}{\gamma^2 n} + \sqrt{\frac{\log(1/\delta)}{n}}\right)$$

where $R$ is the radius of the data and the bound holds with probability $1-\delta$.

Key Insights:

Larger margin → better generalization: The bound decreases as $\gamma$ increases
Independent of dimension: Unlike VC-dimension bounds, margin bounds don't grow with the feature space dimension—crucial for kernels
SVM optimality: By maximizing margin, SVMs directly optimize this generalization bound

Why Kernels Don't Overfit (Usually)

The Representer Theorem:

A deep result connecting SVMs to regularization:

For any problem of the form $\min_f \sum_i L(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}}^2$ where $\mathcal{H}$ is a reproducing kernel Hilbert space, the solution has the form:

$$f^*(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x})$$

Practical Considerations for SVM Implementation

Understanding the optimization theory helps make better practical decisions.

Hyperparameter Selection:

Hyperparameter	Trade-off	Selection Method
$C$	Bias vs. variance	Cross-validation, grid search
$\gamma$ (RBF)	Smoothness vs. flexibility	Cross-validation, heuristics (e.g., $1/d$)
$d$ (polynomial)	Expressiveness vs. overfitting	Problem-dependent

Feature Scaling:

SVMs are sensitive to feature scales because:

The margin depends on $|\mathbf{w}|$, which is scale-dependent
Kernel evaluations (especially RBF) compare distances

Always standardize features (zero mean, unit variance) before SVM training.

Common Pitfalls

Using RBF kernel without tuning γ: Too small → underfitting, too large → overfitting. 2. Not scaling features: Can cause wildly different margins in different directions. 3. Using hard-margin on noisy data: A single outlier can cause infeasibility or poor solutions.

Computational Complexity:

Operation	Time Complexity	Notes
SMO training	O(n² × iterations)	Iterations depend on data difficulty
Kernel matrix	O(n²)	Dominates for small n
Prediction (per point)	O(	SV
Storage	O(	SV

When to Use SVMs:

❌ Very large datasets (neural networks scale better) ❌ When probability estimates are critical (SVM outputs are distances, not probabilities)

Summary: Constrained Optimization in SVMs

We've seen how every concept from constrained optimization manifests in SVMs:

Optimization Concepts in SVMs
Concept	SVM Manifestation
Lagrange multipliers	α values become support vector weights
KKT conditions	Explain support vector sparsity
Complementary slackness	Points either contribute (α > 0) or don't (α = 0)
Duality	Enables kernel trick, reveals inner product structure
Strong duality	Primal and dual optima match (convex problem)
Dual feasibility (α ≥ 0)	Combined with box constraints (α ≤ C) for soft-margin

Module Key Takeaways

•Lagrange multipliers convert constrained problems to unconstrained ones through the Lagrangian, with multipliers measuring constraint sensitivity.
•KKT conditions generalize Lagrange to inequality constraints, with complementary slackness creating sparsity in solutions.
•Duality provides alternative problem formulations that are always convex and reveal structural insights.
•Strong duality for convex problems means primal and dual solutions have equal objective values—solving one solves both.
•SVMs exemplify all these concepts: margin maximization as constrained optimization, support vectors from KKT, kernels from dual inner products.
•The framework extends broadly: regularized learning, maximum entropy, portfolio optimization, and countless other applications.

Module Complete:

Module Complete

5 / 5