Machine LearningSVM Optimization

SVM Optimization

LevelAdvanced

Duration90 mins

TopicSVM Optimization

1 / 5

Sequential Minimal Optimization (SMO) Algorithm

The Algorithm That Made SVMs Practical

In the mid-1990s, Support Vector Machines represented one of the most elegant theoretical frameworks in machine learning—yet remained largely impractical. The barrier wasn't understanding the theory; it was solving the optimization problem. With n training examples, standard quadratic programming (QP) solvers required O(n³) time and O(n²) memory, making even moderately-sized datasets computationally intractable.

Then in 1998, John C. Platt at Microsoft Research published a paper that would transform SVMs from theoretical curiosities into production workhorses: Sequential Minimal Optimization (SMO). This algorithm reduced memory requirements to O(n) and made SVM training feasible on datasets that were orders of magnitude larger than previously possible.

SMO didn't just solve a computational problem—it enabled an entire generation of practical applications, from face detection to text classification to bioinformatics. Understanding SMO is essential for anyone who wants to truly master SVMs beyond the textbook formulas.

What You Will Master

By the end of this page, you will understand: (1) why the SVM dual problem is challenging to solve directly, (2) the key insight that makes SMO work—breaking a massive QP into tiny analytical steps, (3) the complete SMO algorithm with all its components, (4) the mathematical derivations for the two-variable subproblem, and (5) how SMO handles the critical details that determine practical performance.

The Optimization Challenge

Before we can appreciate SMO's elegance, we must understand the optimization problem it solves. Recall the SVM dual formulation:

Maximize: $$W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$

Subject to: $$0 \leq \alpha_i \leq C \quad \forall i$$ $$\sum_{i=1}^{n} \alpha_i y_i = 0$$

where:

$\alpha_i$ are the Lagrange multipliers (dual variables)
$y_i \in {-1, +1}$ are the class labels
$K(x_i, x_j)$ is the kernel function
$C$ is the regularization parameter (for soft-margin SVM)
$n$ is the number of training examples

This is a convex quadratic programming (QP) problem with $n$ variables, $n$ box constraints, and one linear equality constraint.

Convexity is Good News

Because the objective function is concave (we're maximizing) and the constraints are convex, this problem has a unique global optimum. There are no local optima to worry about. The challenge is purely computational: how do we find this optimum efficiently?

Why Standard QP Solvers Struggle

The naive approach uses off-the-shelf QP solvers, but this encounters several fundamental obstacles:

1. Memory Scaling: The kernel matrix $K$ has $n^2$ entries. For $n = 100,000$ training examples with 64-bit floats, storing this matrix requires: $$100,000^2 \times 8 \text{ bytes} = 80 \text{ GB}$$

2. Time Complexity: General QP solvers typically have O(n³) time complexity. Doubling your dataset increases training time by 8×.

3. Matrix Inversion: Many QP methods require inverting or factorizing the kernel matrix—operations that compound the memory and time problems.

4. The Equality Constraint: The constraint $\sum \alpha_i y_i = 0$ couples all variables together, preventing simple coordinate-wise optimization.

Scaling Challenges for Direct QP Approaches
Training Examples (n)	Kernel Matrix Memory	Typical QP Time	Practical Feasibility
1,000	8 MB	~1 second	Easily feasible
10,000	800 MB	~10 minutes	Feasible with patience
100,000	80 GB	~1 week	Requires specialized hardware
1,000,000	8 TB	~20 years	Completely impractical

This scaling behavior meant that for most practical applications in the late 1990s, SVMs were limited to datasets with at most a few thousand examples. Larger problems required either subsampling (losing information) or approximate methods (losing optimality guarantees).

What the field needed was an algorithm that:

Avoided storing the full kernel matrix
Had better than O(n³) complexity in practice
Preserved the exact solution quality of direct QP
Was simple enough to implement reliably

SMO would deliver on all four requirements.

The SMO Insight: Decomposition to Two Variables

The breakthrough insight of SMO is deceptively simple: instead of solving the full n-variable QP, repeatedly solve tiny 2-variable subproblems.

Why exactly two variables? This is the minimum number that can be optimized while respecting the equality constraint $\sum \alpha_i y_i = 0$.

Why One Variable Isn't Enough

Suppose we try to optimize just one variable $\alpha_1$ while holding all others fixed. The equality constraint says: $$\alpha_1 y_1 + \sum_{i=2}^{n} \alpha_i y_i = 0$$ $$\alpha_1 = -y_1 \sum_{i=2}^{n} \alpha_i y_i$$

Since the right side is fixed, $\alpha_1$ is completely determined—we have zero degrees of freedom! Single-variable optimization is impossible.

Why Two Variables Work

With two variables $\alpha_1$ and $\alpha_2$, we have: $$\alpha_1 y_1 + \alpha_2 y_2 = -\sum_{i=3}^{n} \alpha_i y_i = \text{constant}$$

Let's call this constant $\zeta$: $$\alpha_1 y_1 + \alpha_2 y_2 = \zeta$$

This means we can express $\alpha_1$ in terms of $\alpha_2$: $$\alpha_1 = y_1(\zeta - \alpha_2 y_2)$$

Now we effectively have one degree of freedom (choosing $\alpha_2$), and the problem becomes a one-dimensional constrained optimization—which we can solve analytically, without any iterative numerical methods!

The Key Realization

SMO transforms an intractable n-variable QP into a sequence of trivial 1D optimizations. Each step is analytically solvable in O(1) time (excluding kernel evaluations). The algorithm's iterations handle the 'coordination' between variables that makes the global optimum emerge.

The Geometric Picture

Visualizing the two-variable subproblem helps build intuition. In the $(\alpha_1, \alpha_2)$ plane:

Box Constraints: Each variable must satisfy $0 \leq \alpha_i \leq C$, forming a square $[0, C] \times [0, C]$.

Equality Constraint: The constraint $\alpha_1 y_1 + \alpha_2 y_2 = \zeta$ defines a line through this square. The slope of this line depends on $y_1$ and $y_2$:

If $y_1 = y_2$: The line has slope -1
If $y_1 \neq y_2$: The line has slope +1

Feasible Region: The intersection of the box and the line—a line segment within the square.

Objective: A quadratic function that we want to maximize along this line segment.

The solution is simply: find the unconstrained maximum along the line, then clip it to the feasible segment if necessary.

smo_geometric_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_smo_subproblem(C, y1, y2, zeta, objective_fn, title="SMO Two-Variable Subproblem"):
    """
    Visualize the geometry of an SMO two-variable subproblem.
    
    The feasible region is the intersection of:
    1. Box constraints: [0, C] x [0, C]
    2. Linear equality: α₁y₁ + α₂y₂ = ζ
    """
    fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    
    # Draw the box boundary
    box = plt.Rectangle((0, 0), C, C, fill=False, edgecolor='blue', linewidth=2)
    ax.add_patch(box)
    
    # Calculate the line α₁y₁ + α₂y₂ = ζ
    # α₁ = (ζ - α₂y₂) * y₁  (since y₁² = 1)
    alpha2_range = np.linspace(-0.5*C, 1.5*C, 100)
    alpha1_values = y1 * (zeta - alpha2_range * y2)
    
    # Clip to plotting range
    valid_mask = (alpha1_values >= -0.5*C) & (alpha1_values <= 1.5*C)
    ax.plot(alpha2_range[valid_mask], alpha1_values[valid_mask], 
            'r-', linewidth=2, label=f'α₁y₁ + α₂y₂ = {zeta}')
    
    # Find the feasible segment (intersection with box)
    # We need α₂ such that both 0 ≤ α₁ ≤ C and 0 ≤ α₂ ≤ C
    alpha2_candidates = []
    
    # From α₁ = 0: ζ - α₂y₂ = 0 → α₂ = ζ/(y₂)  if valid
    if abs(y2) > 0:
        a2_at_a1_zero = zeta / y2
        if 0 <= a2_at_a1_zero <= C:
            alpha2_candidates.append(a2_at_a1_zero)
    
    # From α₁ = C: (ζ - α₂y₂) * y₁ = C → α₂ = (ζ - C*y₁)/y₂
    if abs(y2) > 0:
        a2_at_a1_C = (zeta - C * y1) / y2
        if 0 <= a2_at_a1_C <= C:
            alpha2_candidates.append(a2_at_a1_C)
    
    # From α₂ = 0: α₁ = ζ * y₁, check if in [0, C]
    a1_at_a2_zero = zeta * y1
    if 0 <= a1_at_a2_zero <= C:
        alpha2_candidates.append(0)
    
    # From α₂ = C: α₁ = (ζ - C*y₂) * y₁, check if in [0, C]
    a1_at_a2_C = (zeta - C * y2) * y1
    if 0 <= a1_at_a2_C <= C:
        alpha2_candidates.append(C)
    
    if len(alpha2_candidates) >= 2:
        L = min(alpha2_candidates)
        H = max(alpha2_candidates)
        
        # Highlight the feasible segment
        a2_seg = np.linspace(L, H, 50)
        a1_seg = y1 * (zeta - a2_seg * y2)
        ax.plot(a2_seg, a1_seg, 'g-', linewidth=4, label='Feasible segment [L, H]')
        
        # Mark endpoints
        ax.scatter([L, H], [y1*(zeta - L*y2), y1*(zeta - H*y2)], 
                   color='green', s=100, zorder=5)
        ax.annotate(f'L={L:.2f}', (L, y1*(zeta - L*y2)), fontsize=10)
        ax.annotate(f'H={H:.2f}', (H, y1*(zeta - H*y2)), fontsize=10)
    
    ax.set_xlim(-0.2*C, 1.3*C)
    ax.set_ylim(-0.2*C, 1.3*C)
    ax.set_xlabel('α₂', fontsize=12)
    ax.set_ylabel('α₁', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_aspect('equal')
    
    return fig
 
# Example: Visualize for different y₁, y₂ combinations
# Case 1: Same class (y₁ = y₂ = 1)
# Case 2: Different class (y₁ = 1, y₂ = -1)
C = 1.0
visualize_smo_subproblem(C, 1, 1, 0.8, None, "Same Class Labels (y₁=y₂=1)")
plt.show()

Solving the Two-Variable Subproblem

Now we derive the analytical solution for the two-variable subproblem. This is the mathematical heart of SMO.

Problem Setup

We want to optimize $\alpha_1$ and $\alpha_2$ while holding all other $\alpha_i$ (for $i \geq 3$) fixed. Let's denote:

$\alpha_1^{old}$, $\alpha_2^{old}$ = current values
$\alpha_1^{new}$, $\alpha_2^{new}$ = values after optimization

The equality constraint must hold both before and after: $$\alpha_1^{old} y_1 + \alpha_2^{old} y_2 = \alpha_1^{new} y_1 + \alpha_2^{new} y_2 = \zeta$$

This gives us: $$\alpha_1^{new} = \alpha_1^{old} + y_1 y_2 (\alpha_2^{old} - \alpha_2^{new})$$

Deriving the Bounds L and H

Before optimizing, we must compute the feasible range $[L, H]$ for $\alpha_2^{new}$. This depends on whether $y_1 = y_2$ or not.

Case 1: $y_1 \neq y_2$ (different classes)

When the labels differ, $y_1 y_2 = -1$. We have: $$\alpha_1^{new} = \alpha_1^{old} - (\alpha_2^{new} - \alpha_2^{old}) = \alpha_1^{old} - \alpha_2^{new} + \alpha_2^{old}$$

For the box constraints $0 \leq \alpha_1^{new} \leq C$ and $0 \leq \alpha_2^{new} \leq C$: $$L = \max(0, \alpha_2^{old} - \alpha_1^{old})$$ $$H = \min(C, C + \alpha_2^{old} - \alpha_1^{old})$$

Case 2: $y_1 = y_2$ (same class)

When labels are the same, $y_1 y_2 = 1$. We have: $$\alpha_1^{new} = \alpha_1^{old} + (\alpha_2^{old} - \alpha_2^{new})$$

The box constraints give: $$L = \max(0, \alpha_1^{old} + \alpha_2^{old} - C)$$ $$H = \min(C, \alpha_1^{old} + \alpha_2^{old})$$

Understanding L and H Geometrically

L and H represent the α₂-coordinates where the constraint line intersects the boundary of the [0,C] × [0,C] box. If the line enters the box at α₂ = L and exits at α₂ = H, then any feasible solution must have L ≤ α₂ ≤ H.

Computing the Unconstrained Optimum

Now we find the optimal $\alpha_2^{new}$ ignoring the bound constraints (we'll clip afterward).

Substituting $\alpha_1$ in terms of $\alpha_2$ into the objective function and differentiating, we get:

$$\alpha_2^{new,unclipped} = \alpha_2^{old} + \frac{y_2(E_1 - E_2)}{\eta}$$

where:

$E_i = f(x_i) - y_i$ is the prediction error for example $i$
$f(x_i) = \sum_j \alpha_j y_j K(x_j, x_i) + b$ is the SVM decision function
$\eta = K(x_1, x_1) + K(x_2, x_2) - 2K(x_1, x_2)$ is the second derivative of the objective along the constraint line

The Role of η (Eta)

The quantity $\eta$ measures the curvature of the objective function:

$\eta > 0$: The objective is concave (normal case), and we have a unique maximum. This is nearly always the case because: $$\eta = |\phi(x_1) - \phi(x_2)|^2 \geq 0$$ with equality only when $x_1 = x_2$ in feature space.
$\eta \leq 0$: Unusual case (can occur with some non-Mercer kernels or numerical issues). The objective is linear or convex, so the optimum is at a boundary. Platt's original paper handles this by evaluating the objective at both endpoints and choosing the better one.

smo_two_variable_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
 
def compute_L_H(alpha1, alpha2, y1, y2, C):
    """
    Compute the feasible bounds [L, H] for α₂^new.
    
    Parameters:
        alpha1, alpha2: Current values of the two Lagrange multipliers
        y1, y2: Class labels {-1, +1}
        C: Regularization parameter (upper bound)
    
    Returns:
        L, H: Lower and upper bounds for α₂^new
    """
    if y1 != y2:
        # Different class labels
        L = max(0.0, alpha2 - alpha1)
        H = min(C, C + alpha2 - alpha1)
    else:
        # Same class labels
        L = max(0.0, alpha1 + alpha2 - C)
        H = min(C, alpha1 + alpha2)
    
    return L, H
 
 
def compute_eta(K11, K12, K22):
    """
    Compute η = K(x₁,x₁) + K(x₂,x₂) - 2K(x₁,x₂).
    
    This is the second derivative of the objective function along
    the constraint line. For Mercer kernels, η ≥ 0.
    """
    return K11 + K22 - 2 * K12
 
 
def optimize_two_variables(alpha1_old, alpha2_old, y1, y2, E1, E2, 
                            K11, K12, K22, C, tol=1e-8):
    """
    Solve the two-variable subproblem analytically.
    
    Parameters:
        alpha1_old, alpha2_old: Current values
        y1, y2: Class labels
        E1, E2: Prediction errors (f(xᵢ) - yᵢ)
        K11, K12, K22: Kernel evaluations K(xᵢ, xⱼ)
        C: Regularization parameter
        tol: Numerical tolerance
    
    Returns:
        alpha1_new, alpha2_new: Optimized values
        changed: Whether significant change occurred
    """
    # Compute bounds
    L, H = compute_L_H(alpha1_old, alpha2_old, y1, y2, C)
    
    # If bounds coincide, no optimization possible
    if abs(L - H) < tol:
        return alpha1_old, alpha2_old, False
    
    # Compute η
    eta = compute_eta(K11, K12, K22)
    
    if eta > 0:
        # Normal case: solve analytically
        alpha2_new_unclipped = alpha2_old + y2 * (E1 - E2) / eta
        
        # Clip to feasible region
        if alpha2_new_unclipped > H:
            alpha2_new = H
        elif alpha2_new_unclipped < L:
            alpha2_new = L
        else:
            alpha2_new = alpha2_new_unclipped
    else:
        # Edge case: η ≤ 0, evaluate objective at endpoints
        # This can happen with ill-conditioned kernels
        
        # Objective function at endpoints (simplified form)
        def objective_at(a2):
            a1 = alpha1_old + y1 * y2 * (alpha2_old - a2)
            # Full objective would be computed here
            # For brevity, we just take the boundary that improves E₂
            return a2
        
        obj_L = objective_at(L)
        obj_H = objective_at(H)
        
        # In practice, we evaluate the full dual objective
        # Here we use a simplified heuristic
        if E1 * y2 < E2 * y2:
            alpha2_new = H
        else:
            alpha2_new = L
    
    # Check if change is significant
    if abs(alpha2_new - alpha2_old) < tol * (alpha2_new + alpha2_old + tol):
        return alpha1_old, alpha2_old, False
    
    # Compute α₁^new from the equality constraint
    alpha1_new = alpha1_old + y1 * y2 * (alpha2_old - alpha2_new)
    
    return alpha1_new, alpha2_new, True
 
 
# Example usage
if __name__ == "__main__":
    # Example: Two support vectors on opposite sides of margin
    alpha1_old, alpha2_old = 0.5, 0.3
    y1, y2 = 1, -1  # Different classes
    E1, E2 = -0.2, 0.1  # Prediction errors
    K11, K12, K22 = 1.0, 0.8, 1.0  # Kernel values (RBF-like)
    C = 1.0
    
    alpha1_new, alpha2_new, changed = optimize_two_variables(
        alpha1_old, alpha2_old, y1, y2, E1, E2, 
        K11, K12, K22, C
    )
    
    print(f"Original: α₁={alpha1_old:.4f}, α₂={alpha2_old:.4f}")
    print(f"Optimized: α₁={alpha1_new:.4f}, α₂={alpha2_new:.4f}")
    print(f"Changed: {changed}")
    
    # Verify equality constraint is preserved
    constraint_old = alpha1_old * y1 + alpha2_old * y2
    constraint_new = alpha1_new * y1 + alpha2_new * y2
    print(f"Constraint before: {constraint_old:.6f}")
    print(f"Constraint after: {constraint_new:.6f}")

The Complete SMO Algorithm

With the two-variable subproblem solved, we can now present the complete SMO algorithm. The key components are:

Outer Loop: Iterates until convergence, selecting pairs of variables to optimize
Inner Loop: Selects the second variable given the first
Subproblem Solver: The analytical solution we derived
Threshold Update: Maintains the bias term $b$
Termination Check: Verifies KKT conditions are satisfied

The SMO Outer Loop

The outer loop alternates between two strategies:

Strategy A: Full Pass Examine all training examples, looking for any that violates KKT conditions significantly. For each violator, try to find a partner and optimize the pair.

Strategy B: Working Set Pass
Examine only the current "working set"—examples where $0 < \alpha_i < C$ (non-bound examples). These are most likely to need adjustment.

Platt's insight: Use Strategy A to find new support vectors, then use Strategy B to fine-tune them. This dramatically reduces the number of expensive kernel evaluations.

smo_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
import numpy as np
 
class SMO:
    """
    Complete SMO implementation for SVM training.
    
    This implementation follows Platt's original algorithm with
    additional optimizations for numerical stability and efficiency.
    """
    
    def __init__(self, kernel_fn, C, tol=1e-3, max_passes=100):
        """
        Initialize SMO solver.
        
        Parameters:
            kernel_fn: Callable K(x_i, x_j) -> float
            C: Regularization parameter
            tol: Tolerance for KKT violation checking
            max_passes: Maximum passes through data without change
        """
        self.kernel_fn = kernel_fn
        self.C = C
        self.tol = tol
        self.max_passes = max_passes
        
        # Will be set during training
        self.X = None          # Training data
        self.y = None          # Labels
        self.alphas = None     # Lagrange multipliers
        self.b = 0             # Threshold/bias
        self.E_cache = None    # Error cache for efficiency
    
    def fit(self, X, y):
        """
        Train the SVM using SMO.
        
        Parameters:
            X: Training features, shape (n_samples, n_features)
            y: Labels, shape (n_samples,), values in {-1, +1}
        """
        n_samples = X.shape[0]
        self.X = X
        self.y = y.astype(float)
        self.alphas = np.zeros(n_samples)
        self.b = 0.0
        self.E_cache = np.zeros(n_samples)
        
        # Initialize error cache: E_i = f(x_i) - y_i = -y_i (since f=0 initially)
        self.E_cache = -self.y
        
        passes = 0
        
        while passes < self.max_passes:
            num_changed = 0
            
            # Alternate between examining all examples and only non-bound examples
            if passes % 2 == 0:
                # Full pass: examine all examples
                for i in range(n_samples):
                    num_changed += self._examine_example(i)
            else:
                # Working set pass: only non-bound examples
                for i in range(n_samples):
                    if 0 < self.alphas[i] < self.C:
                        num_changed += self._examine_example(i)
            
            if num_changed == 0:
                passes += 1
            else:
                passes = 0  # Reset if progress was made
        
        return self
    
    def _examine_example(self, i2):
        """
        Examine example i2 and try to find a partner i1 for optimization.
        
        Returns: 1 if optimization was performed, 0 otherwise
        """
        y2 = self.y[i2]
        alpha2 = self.alphas[i2]
        E2 = self._get_error(i2)
        r2 = E2 * y2  # y_i * E_i = y_i * (f(x_i) - y_i)
        
        # Check if this example violates KKT conditions
        if not self._violates_kkt(alpha2, r2):
            return 0
        
        # Try to find the best i1 (second choice heuristic)
        i1 = self._select_partner(i2, E2)
        
        if i1 >= 0 and self._optimize_pair(i1, i2):
            return 1
        
        # If that didn't work, try starting from random position
        n = len(self.y)
        start = np.random.randint(n)
        for i in range(n):
            i1 = (start + i) % n
            if self.alphas[i1] > 0 and self.alphas[i1] < self.C:
                if self._optimize_pair(i1, i2):
                    return 1
        
        # Still no luck? Try all examples
        start = np.random.randint(n)
        for i in range(n):
            i1 = (start + i) % n
            if self._optimize_pair(i1, i2):
                return 1
        
        return 0  # Failed to make progress
    
    def _violates_kkt(self, alpha, r):
        """
        Check if the KKT conditions are violated.
        
        KKT conditions for SVM:
        - α = 0 → y*f(x) ≥ 1 (non-support vector, correctly classified)
        - 0 < α < C → y*f(x) = 1 (support vector on margin)
        - α = C → y*f(x) ≤ 1 (support vector, possibly misclassified)
        
        r = y * E = y * (f(x) - y) = y*f(x) - 1
        So y*f(x) = r + 1
        """
        return ((r < -self.tol and alpha < self.C) or 
                (r > self.tol and alpha > 0))
    
    def _select_partner(self, i2, E2):
        """
        Select i1 using the second choice heuristic:
        Choose i1 to maximize |E1 - E2|.
        """
        if E2 > 0:
            # Find example with minimum error
            i1 = np.argmin(self.E_cache)
        else:
            # Find example with maximum error
            i1 = np.argmax(self.E_cache)
        
        return i1 if i1 != i2 else -1
    
    def _optimize_pair(self, i1, i2):
        """
        Attempt to jointly optimize alpha[i1] and alpha[i2].
        
        Returns: True if optimization was performed
        """
        if i1 == i2:
            return False
        
        alpha1_old = self.alphas[i1]
        alpha2_old = self.alphas[i2]
        y1 = self.y[i1]
        y2 = self.y[i2]
        E1 = self._get_error(i1)
        E2 = self._get_error(i2)
        
        # Compute bounds
        if y1 != y2:
            L = max(0, alpha2_old - alpha1_old)
            H = min(self.C, self.C + alpha2_old - alpha1_old)
        else:
            L = max(0, alpha1_old + alpha2_old - self.C)
            H = min(self.C, alpha1_old + alpha2_old)
        
        if L >= H:
            return False
        
        # Compute kernel values
        K11 = self.kernel_fn(self.X[i1], self.X[i1])
        K12 = self.kernel_fn(self.X[i1], self.X[i2])
        K22 = self.kernel_fn(self.X[i2], self.X[i2])
        
        # Compute eta
        eta = K11 + K22 - 2 * K12
        
        if eta > 0:
            # Compute new alpha2
            alpha2_new = alpha2_old + y2 * (E1 - E2) / eta
            alpha2_new = np.clip(alpha2_new, L, H)
        else:
            # Edge case: evaluate at endpoints
            # (Simplified - full implementation would evaluate objective)
            return False
        
        # Check for sufficient change
        if abs(alpha2_new - alpha2_old) < self.tol * (alpha2_new + alpha2_old + self.tol):
            return False
        
        # Compute new alpha1
        alpha1_new = alpha1_old + y1 * y2 * (alpha2_old - alpha2_new)
        
        # Update threshold b
        self._update_threshold(i1, i2, alpha1_new, alpha2_new, 
                               alpha1_old, alpha2_old, E1, E2,
                               K11, K12, K22)
        
        # Update alphas
        self.alphas[i1] = alpha1_new
        self.alphas[i2] = alpha2_new
        
        # Update error cache
        self._update_error_cache()
        
        return True
    
    def _update_threshold(self, i1, i2, a1_new, a2_new, 
                          a1_old, a2_old, E1, E2, K11, K12, K22):
        """Update the bias/threshold term b."""
        y1, y2 = self.y[i1], self.y[i2]
        
        b1 = E1 + y1 * (a1_new - a1_old) * K11 + y2 * (a2_new - a2_old) * K12 + self.b
        b2 = E2 + y1 * (a1_new - a1_old) * K12 + y2 * (a2_new - a2_old) * K22 + self.b
        
        if 0 < a1_new < self.C:
            self.b = b1
        elif 0 < a2_new < self.C:
            self.b = b2
        else:
            self.b = (b1 + b2) / 2
    
    def _get_error(self, i):
        """Get cached error or compute it."""
        return self.E_cache[i]
    
    def _update_error_cache(self):
        """Recompute all errors after alpha update."""
        for i in range(len(self.y)):
            self.E_cache[i] = self._decision_function(self.X[i]) - self.y[i]
    
    def _decision_function(self, x):
        """Compute f(x) = Σ α_i y_i K(x_i, x) + b"""
        result = 0.0
        for i in range(len(self.y)):
            if self.alphas[i] > 0:  # Only support vectors contribute
                result += self.alphas[i] * self.y[i] * self.kernel_fn(self.X[i], x)
        return result + self.b
    
    def predict(self, X):
        """Predict class labels for samples in X."""
        predictions = np.array([
            np.sign(self._decision_function(x)) for x in X
        ])
        return predictions
    
    def n_support_vectors(self):
        """Return the number of support vectors."""
        return np.sum(self.alphas > 0)

Updating the Threshold (Bias Term)

After optimizing $\alpha_1$ and $\alpha_2$, we must update the bias term $b$ to maintain KKT conditions. This seemingly minor detail significantly impacts numerical stability.

The Threshold Update Rule

For support vectors (examples where $0 < \alpha_i < C$), the KKT conditions require: $$y_i f(x_i) = 1 \quad \Rightarrow \quad f(x_i) = y_i$$

Since $f(x_i) = \sum_j \alpha_j y_j K(x_j, x_i) + b$, we can solve for $b$: $$b = y_i - \sum_j \alpha_j y_j K(x_j, x_i)$$

After optimizing $\alpha_1$ and $\alpha_2$, we compute two candidate values:

$$b_1 = E_1 + y_1(\alpha_1^{new} - \alpha_1^{old})K_{11} + y_2(\alpha_2^{new} - \alpha_2^{old})K_{12} + b^{old}$$

$$b_2 = E_2 + y_1(\alpha_1^{new} - \alpha_1^{old})K_{12} + y_2(\alpha_2^{new} - \alpha_2^{old})K_{22} + b^{old}$$

Choosing Between b₁ and b₂

The selection rule depends on which alphas are non-bound:

If $0 < \alpha_1^{new} < C$: Use $b = b_1$ (it's exact for example 1)
Else if $0 < \alpha_2^{new} < C$: Use $b = b_2$ (it's exact for example 2)
Else: Use $b = (b_1 + b_2) / 2$ (both are bounds, average is best estimate)

Numerical Stability

The threshold update is sensitive to numerical precision. When both alphas are at bounds (0 or C), any b in the interval [b₁, b₂] (or [b₂, b₁]) is valid. Using the average provides robustness against floating-point errors accumulating over many iterations.

Error Caching

SMO maintains a cache of prediction errors $E_i = f(x_i) - y_i$ for all training examples. This is crucial for:

Efficient KKT Checking: The error directly indicates KKT violation status
Partner Selection: The 'maximum step' heuristic depends on $|E_1 - E_2|$
Avoiding Redundant Computation: Computing $f(x_i)$ requires summing over all support vectors

After each optimization step, errors must be updated. However, full recomputation for all examples would be O(n × number of support vectors). Platt's implementation uses incremental updates:

$$E_i^{new} = E_i^{old} + y_1(\alpha_1^{new} - \alpha_1^{old})K_{1i} + y_2(\alpha_2^{new} - \alpha_2^{old})K_{2i} + (b^{old} - b^{new})$$

This requires only O(n) kernel evaluations per optimization step, keeping the algorithm efficient.

KKT Conditions and Termination

SMO terminates when all training examples satisfy the Karush-Kuhn-Tucker (KKT) conditions within a specified tolerance. Understanding these conditions is essential for tuning convergence behavior.

The KKT Conditions for SVMs

For the soft-margin SVM, the KKT conditions state:

$$\alpha_i = 0 \quad \Rightarrow \quad y_i f(x_i) \geq 1$$ $$0 < \alpha_i < C \quad \Rightarrow \quad y_i f(x_i) = 1$$ $$\alpha_i = C \quad \Rightarrow \quad y_i f(x_i) \leq 1$$

Using the error $E_i = f(x_i) - y_i$ and noting that $y_i^2 = 1$: $$r_i = y_i E_i = y_i f(x_i) - 1$$

So the conditions become:

$\alpha_i = 0 \Rightarrow r_i \geq 0$
$0 < \alpha_i < C \Rightarrow r_i = 0$
$\alpha_i = C \Rightarrow r_i \leq 0$

Tolerance and Termination

In practice, we can't require exact equality. SMO uses a tolerance parameter $\epsilon$ (typically 10⁻³):

An example violates KKT if:

$\alpha_i < C$ and $r_i < -\epsilon$ (should increase $\alpha_i$)
$\alpha_i > 0$ and $r_i > +\epsilon$ (should decrease $\alpha_i$)

KKT Violation Detection
α Value	Interpretation	Violation Condition	Action Needed
α = 0	Non-support vector	y·f(x) < 1 - ε	Increase α (not using this example enough)
0 < α < C	Margin support vector	\|y·f(x) - 1\| > ε	Adjust α (example not exactly on margin)
α = C	Bound support vector	y·f(x) > 1 + ε	Decrease α (penalizing this example too much)

Convergence Rate

SMO's convergence rate depends on several factors:

1. Data Separability: Near-separable data converges quickly because most alphas stay at 0. Highly overlapping classes require more iterations.

2. C Parameter: Larger C leads to more support vectors at bounds (α = 0 or α = C), which are faster to process than free support vectors (0 < α < C).

3. Tolerance: Tighter tolerance (smaller ε) requires more iterations but yields a more accurate solution.

4. Working Set Selection: Better heuristics for choosing optimization pairs can dramatically reduce iteration count (covered in the next page).

Practical Convergence Monitoring

Instead of strictly counting KKT violations, production implementations often monitor:

Primal-Dual Gap: Difference between primal and dual objective values
Maximum Violation: The worst KKT violation across all examples
Number of Support Vector Changes: How many alphas moved from/to bounds

These provide more informative progress indicators than simple iteration counts.

Stopping Early

For very large datasets, waiting for full convergence may be impractical. Many practitioners use early stopping when the primal-dual gap falls below a practical threshold (e.g., 1% of the optimal value). The resulting solution is slightly suboptimal but often indistinguishable in test accuracy.

Comparison with Other QP Approaches

SMO represented a paradigm shift in SVM training. To appreciate its impact, let's compare it with alternative approaches that were used before and those developed concurrently.

Pre-SMO Era: Direct QP Solvers

Interior Point Methods: These general-purpose QP solvers work by following a "central path" through the feasible region. While they have excellent theoretical convergence properties (polynomial time), they suffer from:

O(n²) memory for the kernel matrix
O(n³) per iteration for matrix operations
Poor scaling beyond a few thousand examples

Active Set Methods: These incrementally build the set of active constraints (support vectors). They were state-of-the-art before SMO but still required:

Maintaining and updating matrix factorizations
O(n²) memory
Complex bookkeeping for constraint activation/deactivation

Decomposition Methods

SMO belongs to a family of "decomposition" or "chunking" methods that solve the QP by working on subsets of variables:

Original Chunking (Vapnik, 1982):

Divide variables into chunks
Solve QP for each chunk using standard solver
Iterate until convergence
Still requires storing partial kernel matrices

SMO Advantages

•O(n) Memory: Only stores alpha values and error cache
•No Matrix Inversions: Uses closed-form analytical updates
•Simple Implementation: Core algorithm is ~200 lines of code
•Anytime Property: Can be stopped early with valid (suboptimal) solution
•Kernel Cache Friendly: Sequential access patterns work well with caching

SMO Limitations

•Slow Convergence Tail: Final iterations make tiny progress
•Serial Nature: Hard to parallelize the basic algorithm
•Heuristic-Dependent: Performance varies with implementation details
•Kernel Bottleneck: Many kernel evaluations still needed
•Not Suitable for All Kernels: Assumes Mercer kernel (η > 0)

Modern Alternatives

Since SMO's publication, several improvements and alternatives have emerged:

LIBSVM (Chang & Lin): The most widely-used SVM library. Uses SMO with sophisticated working set selection (WSS3) and shrinking heuristics. We'll cover these optimizations in detail.

SVMlight (Joachims): Uses larger working sets (beyond 2 variables) with iterative optimization. Better for certain problem types.

Coordinate Descent: Dual coordinate descent methods (as in LIBLINEAR) are preferred for linear SVMs due to their simplicity and speed.

Stochastic Methods: For extremely large datasets, stochastic gradient descent on the primal problem (Pegasos, SGD-SVM) scales better than any dual method, though with some loss in solution quality.

When to Use What

The choice of SVM solver depends on your problem:

SVM Solver Selection Guide
Scenario	Recommended Approach	Rationale
Linear kernel, n < 100K	LIBLINEAR (Dual CD)	Fastest for linear problems
Nonlinear kernel, n < 50K	LIBSVM (SMO)	Gold standard for kernel SVMs
Nonlinear kernel, n > 50K	Approximate methods	Random Fourier Features, Nyström
Very large scale (n > 1M)	SGD/Pegasos	Only primal methods scale
Need exact solution	Interior Point	Polynomial-time guaranteed

Summary: The SMO Revolution

We've completed a deep exploration of the SMO algorithm—the computational breakthrough that made SVMs practical. Let's consolidate the key insights:

Key Takeaways

•The Core Insight: SMO decomposes the n-variable QP into a sequence of 2-variable subproblems, each solvable analytically in O(1) time.
•Why Two Variables: The equality constraint Σαᵢyᵢ = 0 prevents single-variable optimization. Two variables is the minimum that allows any freedom.
•The Bounds L and H: The feasible region for α₂ is a line segment within the [0,C]×[0,C] box, with endpoints determined by the constraint.
•The Update Formula: α₂ᵃᵉʷ = α₂ᵒˡᵈ + y₂(E₁ - E₂)/η, where η = K₁₁ + K₂₂ - 2K₁₂.
•Threshold Maintenance: The bias b must be updated after each optimization to maintain KKT conditions for support vectors.
•Convergence: SMO terminates when all examples satisfy KKT conditions within tolerance—guaranteed for convex QP.
•Memory Efficiency: SMO requires only O(n) memory, making it feasible for datasets that would exhaust RAM with direct QP.

What's Next

The basic SMO algorithm we've covered is the foundation, but production implementations include many refinements. In the next page, we'll explore Working Set Selection—the heuristics that determine which pair of variables to optimize at each step. Good working set selection can speed up SMO by 10× or more, transforming it from a clever algorithm into a practical powerhouse.

We'll examine the maximum violating pair (MVP) heuristic, the second-order working set selection in LIBSVM, and the shrinking techniques that further accelerate convergence.

Page Complete

You now understand the SMO algorithm at a deep level—not just how to use it, but why it works. This foundation prepares you for the optimization refinements covered in subsequent pages, and gives you the insight to debug and tune SVM training in practice.

1 / 5

Loading learning content...

Machine LearningSVM Optimization

SVM Optimization

LevelAdvanced

Duration90 mins

TopicSVM Optimization

1 / 5

Sequential Minimal Optimization (SMO) Algorithm

The Algorithm That Made SVMs Practical

What You Will Master

The Optimization Challenge

Before we can appreciate SMO's elegance, we must understand the optimization problem it solves. Recall the SVM dual formulation:

Maximize: $$W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$

Subject to: $$0 \leq \alpha_i \leq C \quad \forall i$$ $$\sum_{i=1}^{n} \alpha_i y_i = 0$$

where:

$\alpha_i$ are the Lagrange multipliers (dual variables)
$y_i \in {-1, +1}$ are the class labels
$K(x_i, x_j)$ is the kernel function
$C$ is the regularization parameter (for soft-margin SVM)
$n$ is the number of training examples

This is a convex quadratic programming (QP) problem with $n$ variables, $n$ box constraints, and one linear equality constraint.

Convexity is Good News

Why Standard QP Solvers Struggle

The naive approach uses off-the-shelf QP solvers, but this encounters several fundamental obstacles:

2. Time Complexity: General QP solvers typically have O(n³) time complexity. Doubling your dataset increases training time by 8×.

3. Matrix Inversion: Many QP methods require inverting or factorizing the kernel matrix—operations that compound the memory and time problems.

4. The Equality Constraint: The constraint $\sum \alpha_i y_i = 0$ couples all variables together, preventing simple coordinate-wise optimization.

Scaling Challenges for Direct QP Approaches
Training Examples (n)	Kernel Matrix Memory	Typical QP Time	Practical Feasibility
1,000	8 MB	~1 second	Easily feasible
10,000	800 MB	~10 minutes	Feasible with patience
100,000	80 GB	~1 week	Requires specialized hardware
1,000,000	8 TB	~20 years	Completely impractical

What the field needed was an algorithm that:

Avoided storing the full kernel matrix
Had better than O(n³) complexity in practice
Preserved the exact solution quality of direct QP
Was simple enough to implement reliably

SMO would deliver on all four requirements.

The SMO Insight: Decomposition to Two Variables

The breakthrough insight of SMO is deceptively simple: instead of solving the full n-variable QP, repeatedly solve tiny 2-variable subproblems.

Why exactly two variables? This is the minimum number that can be optimized while respecting the equality constraint $\sum \alpha_i y_i = 0$.

Why One Variable Isn't Enough

Since the right side is fixed, $\alpha_1$ is completely determined—we have zero degrees of freedom! Single-variable optimization is impossible.

Why Two Variables Work

With two variables $\alpha_1$ and $\alpha_2$, we have: $$\alpha_1 y_1 + \alpha_2 y_2 = -\sum_{i=3}^{n} \alpha_i y_i = \text{constant}$$

Let's call this constant $\zeta$: $$\alpha_1 y_1 + \alpha_2 y_2 = \zeta$$

This means we can express $\alpha_1$ in terms of $\alpha_2$: $$\alpha_1 = y_1(\zeta - \alpha_2 y_2)$$

The Key Realization

The Geometric Picture

Visualizing the two-variable subproblem helps build intuition. In the $(\alpha_1, \alpha_2)$ plane:

Box Constraints: Each variable must satisfy $0 \leq \alpha_i \leq C$, forming a square $[0, C] \times [0, C]$.

Equality Constraint: The constraint $\alpha_1 y_1 + \alpha_2 y_2 = \zeta$ defines a line through this square. The slope of this line depends on $y_1$ and $y_2$:

If $y_1 = y_2$: The line has slope -1
If $y_1 \neq y_2$: The line has slope +1

Feasible Region: The intersection of the box and the line—a line segment within the square.

Objective: A quadratic function that we want to maximize along this line segment.

The solution is simply: find the unconstrained maximum along the line, then clip it to the feasible segment if necessary.

smo_geometric_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_smo_subproblem(C, y1, y2, zeta, objective_fn, title="SMO Two-Variable Subproblem"):
    """
    Visualize the geometry of an SMO two-variable subproblem.
    
    The feasible region is the intersection of:
    1. Box constraints: [0, C] x [0, C]
    2. Linear equality: α₁y₁ + α₂y₂ = ζ
    """
    fig, ax = plt.subplots(1, 1, figsize=(8, 8))
    
    # Draw the box boundary
    box = plt.Rectangle((0, 0), C, C, fill=False, edgecolor='blue', linewidth=2)
    ax.add_patch(box)
    
    # Calculate the line α₁y₁ + α₂y₂ = ζ
    # α₁ = (ζ - α₂y₂) * y₁  (since y₁² = 1)
    alpha2_range = np.linspace(-0.5*C, 1.5*C, 100)
    alpha1_values = y1 * (zeta - alpha2_range * y2)
    
    # Clip to plotting range
    valid_mask = (alpha1_values >= -0.5*C) & (alpha1_values <= 1.5*C)
    ax.plot(alpha2_range[valid_mask], alpha1_values[valid_mask], 
            'r-', linewidth=2, label=f'α₁y₁ + α₂y₂ = {zeta}')
    
    # Find the feasible segment (intersection with box)
    # We need α₂ such that both 0 ≤ α₁ ≤ C and 0 ≤ α₂ ≤ C
    alpha2_candidates = []
    
    # From α₁ = 0: ζ - α₂y₂ = 0 → α₂ = ζ/(y₂)  if valid
    if abs(y2) > 0:
        a2_at_a1_zero = zeta / y2
        if 0 <= a2_at_a1_zero <= C:
            alpha2_candidates.append(a2_at_a1_zero)
    
    # From α₁ = C: (ζ - α₂y₂) * y₁ = C → α₂ = (ζ - C*y₁)/y₂
    if abs(y2) > 0:
        a2_at_a1_C = (zeta - C * y1) / y2
        if 0 <= a2_at_a1_C <= C:
            alpha2_candidates.append(a2_at_a1_C)
    
    # From α₂ = 0: α₁ = ζ * y₁, check if in [0, C]
    a1_at_a2_zero = zeta * y1
    if 0 <= a1_at_a2_zero <= C:
        alpha2_candidates.append(0)
    
    # From α₂ = C: α₁ = (ζ - C*y₂) * y₁, check if in [0, C]
    a1_at_a2_C = (zeta - C * y2) * y1
    if 0 <= a1_at_a2_C <= C:
        alpha2_candidates.append(C)
    
    if len(alpha2_candidates) >= 2:
        L = min(alpha2_candidates)
        H = max(alpha2_candidates)
        
        # Highlight the feasible segment
        a2_seg = np.linspace(L, H, 50)
        a1_seg = y1 * (zeta - a2_seg * y2)
        ax.plot(a2_seg, a1_seg, 'g-', linewidth=4, label='Feasible segment [L, H]')
        
        # Mark endpoints
        ax.scatter([L, H], [y1*(zeta - L*y2), y1*(zeta - H*y2)], 
                   color='green', s=100, zorder=5)
        ax.annotate(f'L={L:.2f}', (L, y1*(zeta - L*y2)), fontsize=10)
        ax.annotate(f'H={H:.2f}', (H, y1*(zeta - H*y2)), fontsize=10)
    
    ax.set_xlim(-0.2*C, 1.3*C)
    ax.set_ylim(-0.2*C, 1.3*C)
    ax.set_xlabel('α₂', fontsize=12)
    ax.set_ylabel('α₁', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_aspect('equal')
    
    return fig
 
# Example: Visualize for different y₁, y₂ combinations
# Case 1: Same class (y₁ = y₂ = 1)
# Case 2: Different class (y₁ = 1, y₂ = -1)
C = 1.0
visualize_smo_subproblem(C, 1, 1, 0.8, None, "Same Class Labels (y₁=y₂=1)")
plt.show()

Solving the Two-Variable Subproblem

Now we derive the analytical solution for the two-variable subproblem. This is the mathematical heart of SMO.

Problem Setup

We want to optimize $\alpha_1$ and $\alpha_2$ while holding all other $\alpha_i$ (for $i \geq 3$) fixed. Let's denote:

$\alpha_1^{old}$, $\alpha_2^{old}$ = current values
$\alpha_1^{new}$, $\alpha_2^{new}$ = values after optimization

The equality constraint must hold both before and after: $$\alpha_1^{old} y_1 + \alpha_2^{old} y_2 = \alpha_1^{new} y_1 + \alpha_2^{new} y_2 = \zeta$$

This gives us: $$\alpha_1^{new} = \alpha_1^{old} + y_1 y_2 (\alpha_2^{old} - \alpha_2^{new})$$

Deriving the Bounds L and H

Before optimizing, we must compute the feasible range $[L, H]$ for $\alpha_2^{new}$. This depends on whether $y_1 = y_2$ or not.

Case 1: $y_1 \neq y_2$ (different classes)

When the labels differ, $y_1 y_2 = -1$. We have: $$\alpha_1^{new} = \alpha_1^{old} - (\alpha_2^{new} - \alpha_2^{old}) = \alpha_1^{old} - \alpha_2^{new} + \alpha_2^{old}$$

For the box constraints $0 \leq \alpha_1^{new} \leq C$ and $0 \leq \alpha_2^{new} \leq C$: $$L = \max(0, \alpha_2^{old} - \alpha_1^{old})$$ $$H = \min(C, C + \alpha_2^{old} - \alpha_1^{old})$$

Case 2: $y_1 = y_2$ (same class)

When labels are the same, $y_1 y_2 = 1$. We have: $$\alpha_1^{new} = \alpha_1^{old} + (\alpha_2^{old} - \alpha_2^{new})$$

The box constraints give: $$L = \max(0, \alpha_1^{old} + \alpha_2^{old} - C)$$ $$H = \min(C, \alpha_1^{old} + \alpha_2^{old})$$

Understanding L and H Geometrically

Computing the Unconstrained Optimum

Now we find the optimal $\alpha_2^{new}$ ignoring the bound constraints (we'll clip afterward).

Substituting $\alpha_1$ in terms of $\alpha_2$ into the objective function and differentiating, we get:

$$\alpha_2^{new,unclipped} = \alpha_2^{old} + \frac{y_2(E_1 - E_2)}{\eta}$$

where:

$E_i = f(x_i) - y_i$ is the prediction error for example $i$
$f(x_i) = \sum_j \alpha_j y_j K(x_j, x_i) + b$ is the SVM decision function
$\eta = K(x_1, x_1) + K(x_2, x_2) - 2K(x_1, x_2)$ is the second derivative of the objective along the constraint line

The Role of η (Eta)

The quantity $\eta$ measures the curvature of the objective function:

$\eta > 0$: The objective is concave (normal case), and we have a unique maximum. This is nearly always the case because: $$\eta = |\phi(x_1) - \phi(x_2)|^2 \geq 0$$ with equality only when $x_1 = x_2$ in feature space.
$\eta \leq 0$: Unusual case (can occur with some non-Mercer kernels or numerical issues). The objective is linear or convex, so the optimum is at a boundary. Platt's original paper handles this by evaluating the objective at both endpoints and choosing the better one.

smo_two_variable_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
 
def compute_L_H(alpha1, alpha2, y1, y2, C):
    """
    Compute the feasible bounds [L, H] for α₂^new.
    
    Parameters:
        alpha1, alpha2: Current values of the two Lagrange multipliers
        y1, y2: Class labels {-1, +1}
        C: Regularization parameter (upper bound)
    
    Returns:
        L, H: Lower and upper bounds for α₂^new
    """
    if y1 != y2:
        # Different class labels
        L = max(0.0, alpha2 - alpha1)
        H = min(C, C + alpha2 - alpha1)
    else:
        # Same class labels
        L = max(0.0, alpha1 + alpha2 - C)
        H = min(C, alpha1 + alpha2)
    
    return L, H
 
 
def compute_eta(K11, K12, K22):
    """
    Compute η = K(x₁,x₁) + K(x₂,x₂) - 2K(x₁,x₂).
    
    This is the second derivative of the objective function along
    the constraint line. For Mercer kernels, η ≥ 0.
    """
    return K11 + K22 - 2 * K12
 
 
def optimize_two_variables(alpha1_old, alpha2_old, y1, y2, E1, E2, 
                            K11, K12, K22, C, tol=1e-8):
    """
    Solve the two-variable subproblem analytically.
    
    Parameters:
        alpha1_old, alpha2_old: Current values
        y1, y2: Class labels
        E1, E2: Prediction errors (f(xᵢ) - yᵢ)
        K11, K12, K22: Kernel evaluations K(xᵢ, xⱼ)
        C: Regularization parameter
        tol: Numerical tolerance
    
    Returns:
        alpha1_new, alpha2_new: Optimized values
        changed: Whether significant change occurred
    """
    # Compute bounds
    L, H = compute_L_H(alpha1_old, alpha2_old, y1, y2, C)
    
    # If bounds coincide, no optimization possible
    if abs(L - H) < tol:
        return alpha1_old, alpha2_old, False
    
    # Compute η
    eta = compute_eta(K11, K12, K22)
    
    if eta > 0:
        # Normal case: solve analytically
        alpha2_new_unclipped = alpha2_old + y2 * (E1 - E2) / eta
        
        # Clip to feasible region
        if alpha2_new_unclipped > H:
            alpha2_new = H
        elif alpha2_new_unclipped < L:
            alpha2_new = L
        else:
            alpha2_new = alpha2_new_unclipped
    else:
        # Edge case: η ≤ 0, evaluate objective at endpoints
        # This can happen with ill-conditioned kernels
        
        # Objective function at endpoints (simplified form)
        def objective_at(a2):
            a1 = alpha1_old + y1 * y2 * (alpha2_old - a2)
            # Full objective would be computed here
            # For brevity, we just take the boundary that improves E₂
            return a2
        
        obj_L = objective_at(L)
        obj_H = objective_at(H)
        
        # In practice, we evaluate the full dual objective
        # Here we use a simplified heuristic
        if E1 * y2 < E2 * y2:
            alpha2_new = H
        else:
            alpha2_new = L
    
    # Check if change is significant
    if abs(alpha2_new - alpha2_old) < tol * (alpha2_new + alpha2_old + tol):
        return alpha1_old, alpha2_old, False
    
    # Compute α₁^new from the equality constraint
    alpha1_new = alpha1_old + y1 * y2 * (alpha2_old - alpha2_new)
    
    return alpha1_new, alpha2_new, True
 
 
# Example usage
if __name__ == "__main__":
    # Example: Two support vectors on opposite sides of margin
    alpha1_old, alpha2_old = 0.5, 0.3
    y1, y2 = 1, -1  # Different classes
    E1, E2 = -0.2, 0.1  # Prediction errors
    K11, K12, K22 = 1.0, 0.8, 1.0  # Kernel values (RBF-like)
    C = 1.0
    
    alpha1_new, alpha2_new, changed = optimize_two_variables(
        alpha1_old, alpha2_old, y1, y2, E1, E2, 
        K11, K12, K22, C
    )
    
    print(f"Original: α₁={alpha1_old:.4f}, α₂={alpha2_old:.4f}")
    print(f"Optimized: α₁={alpha1_new:.4f}, α₂={alpha2_new:.4f}")
    print(f"Changed: {changed}")
    
    # Verify equality constraint is preserved
    constraint_old = alpha1_old * y1 + alpha2_old * y2
    constraint_new = alpha1_new * y1 + alpha2_new * y2
    print(f"Constraint before: {constraint_old:.6f}")
    print(f"Constraint after: {constraint_new:.6f}")

The Complete SMO Algorithm

With the two-variable subproblem solved, we can now present the complete SMO algorithm. The key components are:

Outer Loop: Iterates until convergence, selecting pairs of variables to optimize
Inner Loop: Selects the second variable given the first
Subproblem Solver: The analytical solution we derived
Threshold Update: Maintains the bias term $b$
Termination Check: Verifies KKT conditions are satisfied

The SMO Outer Loop

The outer loop alternates between two strategies:

Strategy A: Full Pass Examine all training examples, looking for any that violates KKT conditions significantly. For each violator, try to find a partner and optimize the pair.

Strategy B: Working Set Pass
Examine only the current "working set"—examples where $0 < \alpha_i < C$ (non-bound examples). These are most likely to need adjustment.

Platt's insight: Use Strategy A to find new support vectors, then use Strategy B to fine-tune them. This dramatically reduces the number of expensive kernel evaluations.

smo_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
import numpy as np
 
class SMO:
    """
    Complete SMO implementation for SVM training.
    
    This implementation follows Platt's original algorithm with
    additional optimizations for numerical stability and efficiency.
    """
    
    def __init__(self, kernel_fn, C, tol=1e-3, max_passes=100):
        """
        Initialize SMO solver.
        
        Parameters:
            kernel_fn: Callable K(x_i, x_j) -> float
            C: Regularization parameter
            tol: Tolerance for KKT violation checking
            max_passes: Maximum passes through data without change
        """
        self.kernel_fn = kernel_fn
        self.C = C
        self.tol = tol
        self.max_passes = max_passes
        
        # Will be set during training
        self.X = None          # Training data
        self.y = None          # Labels
        self.alphas = None     # Lagrange multipliers
        self.b = 0             # Threshold/bias
        self.E_cache = None    # Error cache for efficiency
    
    def fit(self, X, y):
        """
        Train the SVM using SMO.
        
        Parameters:
            X: Training features, shape (n_samples, n_features)
            y: Labels, shape (n_samples,), values in {-1, +1}
        """
        n_samples = X.shape[0]
        self.X = X
        self.y = y.astype(float)
        self.alphas = np.zeros(n_samples)
        self.b = 0.0
        self.E_cache = np.zeros(n_samples)
        
        # Initialize error cache: E_i = f(x_i) - y_i = -y_i (since f=0 initially)
        self.E_cache = -self.y
        
        passes = 0
        
        while passes < self.max_passes:
            num_changed = 0
            
            # Alternate between examining all examples and only non-bound examples
            if passes % 2 == 0:
                # Full pass: examine all examples
                for i in range(n_samples):
                    num_changed += self._examine_example(i)
            else:
                # Working set pass: only non-bound examples
                for i in range(n_samples):
                    if 0 < self.alphas[i] < self.C:
                        num_changed += self._examine_example(i)
            
            if num_changed == 0:
                passes += 1
            else:
                passes = 0  # Reset if progress was made
        
        return self
    
    def _examine_example(self, i2):
        """
        Examine example i2 and try to find a partner i1 for optimization.
        
        Returns: 1 if optimization was performed, 0 otherwise
        """
        y2 = self.y[i2]
        alpha2 = self.alphas[i2]
        E2 = self._get_error(i2)
        r2 = E2 * y2  # y_i * E_i = y_i * (f(x_i) - y_i)
        
        # Check if this example violates KKT conditions
        if not self._violates_kkt(alpha2, r2):
            return 0
        
        # Try to find the best i1 (second choice heuristic)
        i1 = self._select_partner(i2, E2)
        
        if i1 >= 0 and self._optimize_pair(i1, i2):
            return 1
        
        # If that didn't work, try starting from random position
        n = len(self.y)
        start = np.random.randint(n)
        for i in range(n):
            i1 = (start + i) % n
            if self.alphas[i1] > 0 and self.alphas[i1] < self.C:
                if self._optimize_pair(i1, i2):
                    return 1
        
        # Still no luck? Try all examples
        start = np.random.randint(n)
        for i in range(n):
            i1 = (start + i) % n
            if self._optimize_pair(i1, i2):
                return 1
        
        return 0  # Failed to make progress
    
    def _violates_kkt(self, alpha, r):
        """
        Check if the KKT conditions are violated.
        
        KKT conditions for SVM:
        - α = 0 → y*f(x) ≥ 1 (non-support vector, correctly classified)
        - 0 < α < C → y*f(x) = 1 (support vector on margin)
        - α = C → y*f(x) ≤ 1 (support vector, possibly misclassified)
        
        r = y * E = y * (f(x) - y) = y*f(x) - 1
        So y*f(x) = r + 1
        """
        return ((r < -self.tol and alpha < self.C) or 
                (r > self.tol and alpha > 0))
    
    def _select_partner(self, i2, E2):
        """
        Select i1 using the second choice heuristic:
        Choose i1 to maximize |E1 - E2|.
        """
        if E2 > 0:
            # Find example with minimum error
            i1 = np.argmin(self.E_cache)
        else:
            # Find example with maximum error
            i1 = np.argmax(self.E_cache)
        
        return i1 if i1 != i2 else -1
    
    def _optimize_pair(self, i1, i2):
        """
        Attempt to jointly optimize alpha[i1] and alpha[i2].
        
        Returns: True if optimization was performed
        """
        if i1 == i2:
            return False
        
        alpha1_old = self.alphas[i1]
        alpha2_old = self.alphas[i2]
        y1 = self.y[i1]
        y2 = self.y[i2]
        E1 = self._get_error(i1)
        E2 = self._get_error(i2)
        
        # Compute bounds
        if y1 != y2:
            L = max(0, alpha2_old - alpha1_old)
            H = min(self.C, self.C + alpha2_old - alpha1_old)
        else:
            L = max(0, alpha1_old + alpha2_old - self.C)
            H = min(self.C, alpha1_old + alpha2_old)
        
        if L >= H:
            return False
        
        # Compute kernel values
        K11 = self.kernel_fn(self.X[i1], self.X[i1])
        K12 = self.kernel_fn(self.X[i1], self.X[i2])
        K22 = self.kernel_fn(self.X[i2], self.X[i2])
        
        # Compute eta
        eta = K11 + K22 - 2 * K12
        
        if eta > 0:
            # Compute new alpha2
            alpha2_new = alpha2_old + y2 * (E1 - E2) / eta
            alpha2_new = np.clip(alpha2_new, L, H)
        else:
            # Edge case: evaluate at endpoints
            # (Simplified - full implementation would evaluate objective)
            return False
        
        # Check for sufficient change
        if abs(alpha2_new - alpha2_old) < self.tol * (alpha2_new + alpha2_old + self.tol):
            return False
        
        # Compute new alpha1
        alpha1_new = alpha1_old + y1 * y2 * (alpha2_old - alpha2_new)
        
        # Update threshold b
        self._update_threshold(i1, i2, alpha1_new, alpha2_new, 
                               alpha1_old, alpha2_old, E1, E2,
                               K11, K12, K22)
        
        # Update alphas
        self.alphas[i1] = alpha1_new
        self.alphas[i2] = alpha2_new
        
        # Update error cache
        self._update_error_cache()
        
        return True
    
    def _update_threshold(self, i1, i2, a1_new, a2_new, 
                          a1_old, a2_old, E1, E2, K11, K12, K22):
        """Update the bias/threshold term b."""
        y1, y2 = self.y[i1], self.y[i2]
        
        b1 = E1 + y1 * (a1_new - a1_old) * K11 + y2 * (a2_new - a2_old) * K12 + self.b
        b2 = E2 + y1 * (a1_new - a1_old) * K12 + y2 * (a2_new - a2_old) * K22 + self.b
        
        if 0 < a1_new < self.C:
            self.b = b1
        elif 0 < a2_new < self.C:
            self.b = b2
        else:
            self.b = (b1 + b2) / 2
    
    def _get_error(self, i):
        """Get cached error or compute it."""
        return self.E_cache[i]
    
    def _update_error_cache(self):
        """Recompute all errors after alpha update."""
        for i in range(len(self.y)):
            self.E_cache[i] = self._decision_function(self.X[i]) - self.y[i]
    
    def _decision_function(self, x):
        """Compute f(x) = Σ α_i y_i K(x_i, x) + b"""
        result = 0.0
        for i in range(len(self.y)):
            if self.alphas[i] > 0:  # Only support vectors contribute
                result += self.alphas[i] * self.y[i] * self.kernel_fn(self.X[i], x)
        return result + self.b
    
    def predict(self, X):
        """Predict class labels for samples in X."""
        predictions = np.array([
            np.sign(self._decision_function(x)) for x in X
        ])
        return predictions
    
    def n_support_vectors(self):
        """Return the number of support vectors."""
        return np.sum(self.alphas > 0)

Updating the Threshold (Bias Term)

After optimizing $\alpha_1$ and $\alpha_2$, we must update the bias term $b$ to maintain KKT conditions. This seemingly minor detail significantly impacts numerical stability.

The Threshold Update Rule

For support vectors (examples where $0 < \alpha_i < C$), the KKT conditions require: $$y_i f(x_i) = 1 \quad \Rightarrow \quad f(x_i) = y_i$$

Since $f(x_i) = \sum_j \alpha_j y_j K(x_j, x_i) + b$, we can solve for $b$: $$b = y_i - \sum_j \alpha_j y_j K(x_j, x_i)$$

After optimizing $\alpha_1$ and $\alpha_2$, we compute two candidate values:

$$b_1 = E_1 + y_1(\alpha_1^{new} - \alpha_1^{old})K_{11} + y_2(\alpha_2^{new} - \alpha_2^{old})K_{12} + b^{old}$$

$$b_2 = E_2 + y_1(\alpha_1^{new} - \alpha_1^{old})K_{12} + y_2(\alpha_2^{new} - \alpha_2^{old})K_{22} + b^{old}$$

Choosing Between b₁ and b₂

The selection rule depends on which alphas are non-bound:

If $0 < \alpha_1^{new} < C$: Use $b = b_1$ (it's exact for example 1)
Else if $0 < \alpha_2^{new} < C$: Use $b = b_2$ (it's exact for example 2)
Else: Use $b = (b_1 + b_2) / 2$ (both are bounds, average is best estimate)

Numerical Stability

Error Caching

SMO maintains a cache of prediction errors $E_i = f(x_i) - y_i$ for all training examples. This is crucial for:

Efficient KKT Checking: The error directly indicates KKT violation status
Partner Selection: The 'maximum step' heuristic depends on $|E_1 - E_2|$
Avoiding Redundant Computation: Computing $f(x_i)$ requires summing over all support vectors

After each optimization step, errors must be updated. However, full recomputation for all examples would be O(n × number of support vectors). Platt's implementation uses incremental updates:

$$E_i^{new} = E_i^{old} + y_1(\alpha_1^{new} - \alpha_1^{old})K_{1i} + y_2(\alpha_2^{new} - \alpha_2^{old})K_{2i} + (b^{old} - b^{new})$$

This requires only O(n) kernel evaluations per optimization step, keeping the algorithm efficient.

KKT Conditions and Termination

SMO terminates when all training examples satisfy the Karush-Kuhn-Tucker (KKT) conditions within a specified tolerance. Understanding these conditions is essential for tuning convergence behavior.

The KKT Conditions for SVMs

For the soft-margin SVM, the KKT conditions state:

$$\alpha_i = 0 \quad \Rightarrow \quad y_i f(x_i) \geq 1$$ $$0 < \alpha_i < C \quad \Rightarrow \quad y_i f(x_i) = 1$$ $$\alpha_i = C \quad \Rightarrow \quad y_i f(x_i) \leq 1$$

Using the error $E_i = f(x_i) - y_i$ and noting that $y_i^2 = 1$: $$r_i = y_i E_i = y_i f(x_i) - 1$$

So the conditions become:

$\alpha_i = 0 \Rightarrow r_i \geq 0$
$0 < \alpha_i < C \Rightarrow r_i = 0$
$\alpha_i = C \Rightarrow r_i \leq 0$

Tolerance and Termination

In practice, we can't require exact equality. SMO uses a tolerance parameter $\epsilon$ (typically 10⁻³):

An example violates KKT if:

$\alpha_i < C$ and $r_i < -\epsilon$ (should increase $\alpha_i$)
$\alpha_i > 0$ and $r_i > +\epsilon$ (should decrease $\alpha_i$)

KKT Violation Detection
α Value	Interpretation	Violation Condition	Action Needed
α = 0	Non-support vector	y·f(x) < 1 - ε	Increase α (not using this example enough)
0 < α < C	Margin support vector	\|y·f(x) - 1\| > ε	Adjust α (example not exactly on margin)
α = C	Bound support vector	y·f(x) > 1 + ε	Decrease α (penalizing this example too much)

Convergence Rate

SMO's convergence rate depends on several factors:

1. Data Separability: Near-separable data converges quickly because most alphas stay at 0. Highly overlapping classes require more iterations.

2. C Parameter: Larger C leads to more support vectors at bounds (α = 0 or α = C), which are faster to process than free support vectors (0 < α < C).

3. Tolerance: Tighter tolerance (smaller ε) requires more iterations but yields a more accurate solution.

4. Working Set Selection: Better heuristics for choosing optimization pairs can dramatically reduce iteration count (covered in the next page).

Practical Convergence Monitoring

Instead of strictly counting KKT violations, production implementations often monitor:

Primal-Dual Gap: Difference between primal and dual objective values
Maximum Violation: The worst KKT violation across all examples
Number of Support Vector Changes: How many alphas moved from/to bounds

These provide more informative progress indicators than simple iteration counts.

Stopping Early

Comparison with Other QP Approaches

SMO represented a paradigm shift in SVM training. To appreciate its impact, let's compare it with alternative approaches that were used before and those developed concurrently.

Pre-SMO Era: Direct QP Solvers

O(n²) memory for the kernel matrix
O(n³) per iteration for matrix operations
Poor scaling beyond a few thousand examples

Active Set Methods: These incrementally build the set of active constraints (support vectors). They were state-of-the-art before SMO but still required:

Maintaining and updating matrix factorizations
O(n²) memory
Complex bookkeeping for constraint activation/deactivation

Decomposition Methods

SMO belongs to a family of "decomposition" or "chunking" methods that solve the QP by working on subsets of variables:

Original Chunking (Vapnik, 1982):

Divide variables into chunks
Solve QP for each chunk using standard solver
Iterate until convergence
Still requires storing partial kernel matrices

SMO Advantages

•O(n) Memory: Only stores alpha values and error cache
•No Matrix Inversions: Uses closed-form analytical updates
•Simple Implementation: Core algorithm is ~200 lines of code
•Anytime Property: Can be stopped early with valid (suboptimal) solution
•Kernel Cache Friendly: Sequential access patterns work well with caching

SMO Limitations

•Slow Convergence Tail: Final iterations make tiny progress
•Serial Nature: Hard to parallelize the basic algorithm
•Heuristic-Dependent: Performance varies with implementation details
•Kernel Bottleneck: Many kernel evaluations still needed
•Not Suitable for All Kernels: Assumes Mercer kernel (η > 0)

Modern Alternatives

Since SMO's publication, several improvements and alternatives have emerged:

LIBSVM (Chang & Lin): The most widely-used SVM library. Uses SMO with sophisticated working set selection (WSS3) and shrinking heuristics. We'll cover these optimizations in detail.

SVMlight (Joachims): Uses larger working sets (beyond 2 variables) with iterative optimization. Better for certain problem types.

Coordinate Descent: Dual coordinate descent methods (as in LIBLINEAR) are preferred for linear SVMs due to their simplicity and speed.

When to Use What

The choice of SVM solver depends on your problem:

SVM Solver Selection Guide
Scenario	Recommended Approach	Rationale
Linear kernel, n < 100K	LIBLINEAR (Dual CD)	Fastest for linear problems
Nonlinear kernel, n < 50K	LIBSVM (SMO)	Gold standard for kernel SVMs
Nonlinear kernel, n > 50K	Approximate methods	Random Fourier Features, Nyström
Very large scale (n > 1M)	SGD/Pegasos	Only primal methods scale
Need exact solution	Interior Point	Polynomial-time guaranteed

Summary: The SMO Revolution

We've completed a deep exploration of the SMO algorithm—the computational breakthrough that made SVMs practical. Let's consolidate the key insights:

Key Takeaways

•The Core Insight: SMO decomposes the n-variable QP into a sequence of 2-variable subproblems, each solvable analytically in O(1) time.
•Why Two Variables: The equality constraint Σαᵢyᵢ = 0 prevents single-variable optimization. Two variables is the minimum that allows any freedom.
•The Bounds L and H: The feasible region for α₂ is a line segment within the [0,C]×[0,C] box, with endpoints determined by the constraint.
•The Update Formula: α₂ᵃᵉʷ = α₂ᵒˡᵈ + y₂(E₁ - E₂)/η, where η = K₁₁ + K₂₂ - 2K₁₂.
•Threshold Maintenance: The bias b must be updated after each optimization to maintain KKT conditions for support vectors.
•Convergence: SMO terminates when all examples satisfy KKT conditions within tolerance—guaranteed for convex QP.
•Memory Efficiency: SMO requires only O(n) memory, making it feasible for datasets that would exhaust RAM with direct QP.

What's Next

We'll examine the maximum violating pair (MVP) heuristic, the second-order working set selection in LIBSVM, and the shrinking techniques that further accelerate convergence.

Page Complete

1 / 5