Svm Optimization - Learning Module

Loading content...

0/245

Working Set Selection

The Art of Choosing What to Optimize

The SMO algorithm's correctness is guaranteed for any valid pair selection—even random choices would eventually converge. But in practice, the choice of which pair to optimize can affect total runtime by orders of magnitude.

Consider a dataset with 100,000 training examples. Random pair selection might require millions of iterations to converge. Intelligent selection—using heuristics that identify the pairs most likely to make progress—can reduce this to tens of thousands. The difference between a computation finishing in minutes versus days often comes down to sophisticated working set selection.

This page explores the evolution of working set selection strategies, from Platt's original heuristics to the second-order methods used in modern implementations like LIBSVM. You'll understand not just what these heuristics do, but why they work—enabling you to tune and debug SVM training effectively.

What You Will Master

By the end of this page, you will understand: (1) why naive selection strategies fail, (2) Platt's original first-choice and second-choice heuristics, (3) the maximum violating pair (MVP) strategy, (4) second-order working set selection (WSS2/WSS3), (5) shrinking—the technique that prunes the working set, and (6) caching strategies for kernel evaluations.

Why Working Set Selection Matters

At each SMO iteration, we select two variables (αᵢ, αⱼ) to jointly optimize. With n training examples, there are $\binom{n}{2} = \frac{n(n-1)}{2}$ possible pairs. For n = 10,000, that's nearly 50 million options.

Not all pairs are equally useful:

Useless Pairs:

Both variables satisfy KKT conditions (no improvement possible)
One or both are at bounds with no room to move
The step size would be negligible

Highly Useful Pairs:

Both violate KKT conditions in opposite directions
Moving them toward feasibility makes significant progress
The kernel geometry allows a large step

The Impact of Selection Quality

Empirical studies show dramatic differences in iteration counts:

Impact of Working Set Selection on SMO Convergence (Adult Dataset, n≈32K)
Selection Strategy	Iterations to Converge	Kernel Evaluations	Training Time
Random selection	~2,500,000	~80 billion	Hours
Platt's heuristics	~150,000	~5 billion	Minutes
MVP (first-order)	~45,000	~1.5 billion	~1 minute
WSS3 (second-order)	~18,000	~600 million	~30 seconds

The 140× reduction in iterations between random and WSS3 represents the accumulated insights of two decades of algorithm development. Let's trace this evolution.

The Fundamental Trade-off

Better selection strategies require more computation per iteration:

Random: O(1) selection time, many iterations
Platt's Heuristics: O(n) selection time, fewer iterations
Second-order Methods: O(n) selection with more kernel evaluations per selection, far fewer iterations

The key insight is that kernel evaluations dominate SVM training cost. A strategy that doubles selection overhead but halves iterations saves 50% of total kernel evaluations—usually a net win.

Working Set vs. Decomposition Size

SMO uses a working set of size 2. Some algorithms (like SVMlight) use larger working sets of 10-50 variables, solving the resulting QP with an off-the-shelf solver. Larger working sets make more progress per iteration but require more complex QP solvers. SMO's working set of 2 allows analytical solutions, trading iteration count for simplicity.

Platt's Original Heuristics

John Platt's 1998 paper introduced two heuristics that made SMO practical: first-choice heuristic for selecting the first variable, and second-choice heuristic for selecting the second.

First-Choice Heuristic (Outer Loop)

The first variable is chosen by cycling through examples and looking for KKT violations:

Strategy:

First, examine only non-bound examples (0 < αᵢ < C)
If no non-bound examples violate KKT, examine all examples
Alternate between these two modes

Rationale: Non-bound examples are the most "active"—they directly determine the decision boundary. An example at a bound (αᵢ = 0 or αᵢ = C) has effectively been decided: it's either not a support vector or it's being maximally penalized.

By focusing on non-bound examples first, SMO makes maximum progress on the examples that matter most. Full passes are done periodically to discover new support vectors (examples that should move away from αᵢ = 0).

Second-Choice Heuristic (Inner Loop)

Given the first variable i₁ with error E₁, the second variable is chosen to maximize expected progress:

The Maximum Step Heuristic:

Choose i₂ to maximize |E₁ - E₂|.

Recall that the step size for α₂ is: $$\Delta\alpha_2 = \frac{y_2(E_1 - E_2)}{\eta}$$

For a fixed η (which we don't know without computing kernel values), maximizing |E₁ - E₂| maximizes the step size.

platt_heuristics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
 
class PlattSMO:
                                                        """
    SMO with Platt's original heuristics for working set selection.
    """
    
    def __init__(self, kernel_fn, C, tol=1e-3, eps=1e-5):
        self.kernel_fn = kernel_fn
        self.C = C
        self.tol = tol  # Tolerance for KKT checking
        self.eps = eps  # Minimum change threshold
        
    def first_choice_heuristic(self, alphas, errors, y, C):
        """
        First-choice heuristic: select first variable for optimization.
        
        Priority:
        1. Non-bound examples (0 < α < C) that violate KKT
        2. All examples that violate KKT
        
        Returns: index of first variable, or -1 if none found
        """
        n = len(alphas)
        
        # First pass: check non-bound examples only
        non_bound_indices = np.where((alphas > 0) & (alphas < C))[0]
        
        for i in non_bound_indices:
            if self._violates_kkt(alphas[i], y[i] * errors[i]):
                return i
        
        # Second pass: check all examples
        for i in range(n):
            if self._violates_kkt(alphas[i], y[i] * errors[i]):
                return i
        
        return -1  # No KKT violation found
    
    def second_choice_heuristic(self, i1, E1, errors, alphas, y, X, C):
        """
        Second-choice heuristic: select second variable for optimization.
        
        Strategy: Choose i2 to maximize |E1 - E2|
        
        If no non-bound examples exist, fall back to random selection.
        """
        n = len(alphas)
        i2 = -1
        max_delta_E = 0
        
        # First, look among non-bound examples for max |E1 - E2|
        non_bound_indices = np.where((alphas > 0) & (alphas < C))[0]
        
        for i in non_bound_indices:
            if i == i1:
                continue
            delta_E = abs(E1 - errors[i])
            if delta_E > max_delta_E:
                max_delta_E = delta_E
                i2 = i
        
        if i2 >= 0:
            return i2
        
        # No suitable non-bound example found
        # Fall back to checking all examples
        # Use hierarchical search:
        
        # 1. Try non-bound examples (random order)
        np.random.shuffle(non_bound_indices)
        for i in non_bound_indices:
            if i != i1:
                return i
        
        # 2. Try bound examples (random order)
        all_indices = np.arange(n)
        np.random.shuffle(all_indices)
        for i in all_indices:
            if i != i1:
                return i
        
        return -1  # Never reached if n > 1
    
    def _violates_kkt(self, alpha, r):
        """
        Check if KKT conditions are violated.
        r = y * E = y * (f(x) - y) = y * f(x) - 1
        """
        return ((r < -self.tol and alpha < self.C) or 
                (r > self.tol and alpha > 0))
    
    def select_working_set(self, alphas, errors, y, X, C):
        """
        Select the pair (i1, i2) to optimize using Platt's heuristics.
        
        Returns: (i1, i2) or (-1, -1) if converged
        """
        # First choice: find a KKT violator
        i1 = self.first_choice_heuristic(alphas, errors, y, C)
        
        if i1 < 0:
            return -1, -1  # Converged
        
        E1 = errors[i1]
        
        # Second choice: maximize |E1 - E2|
        i2 = self.second_choice_heuristic(i1, E1, errors, alphas, y, X, C)
        
        return i1, i2
 
 
# Example demonstrating the heuristic
def demonstrate_heuristic():
    """
    Show how second-choice picks the variable with maximum error difference.
    """
    np.random.seed(42)
    n = 10
    
    # Simulated state: some alphas, errors
    alphas = np.array([0.0, 0.5, 0.0, 0.3, 1.0, 0.7, 0.0, 0.2, 1.0, 0.4])
    errors = np.array([-0.1, 0.3, 0.0, -0.5, 0.2, 0.8, -0.2, 0.1, 0.0, -0.4])
    
    # Suppose first choice selected i1=3 with E1=-0.5
    i1 = 3
    E1 = errors[i1]
    
    # Non-bound examples: indices where 0 < alpha < C
    C = 1.0
    non_bound = np.where((alphas > 0) & (alphas < C))[0]
    print(f"Non-bound examples: {non_bound}")
    print(f"First choice: i1={i1}, E1={E1}")
    
    # Find max |E1 - E2| among non-bound
    for i in non_bound:
        if i != i1:
            print(f"  i={i}: E[{i}]={errors[i]:.2f}, |E1-E2|={abs(E1-errors[i]):.2f}")
    
    # The maximum is E1=-0.5 vs E5=0.8, giving |E1-E2|=1.3
    print("\nSecond choice selects i2=5 (largest |E1-E2|=1.3)")
 
demonstrate_heuristic()

Limitations of Platt's Heuristics

While groundbreaking, Platt's heuristics have shortcomings:

Ignores η: The step size depends on |E₁ - E₂|/η, but Platt's heuristic ignores η. Two examples with similar errors but very different η values are treated the same.
Error Cache Staleness: Errors change as optimization proceeds, but are only updated after full passes. The cached errors may not reflect current model state.
No Lookahead: The heuristic optimizes locally without considering whether the chosen pair will make global progress toward the solution.

These limitations motivated the development of more sophisticated selection strategies.

Maximum Violating Pair (MVP) Strategy

The Maximum Violating Pair (MVP) strategy, also known as first-order working set selection, provides a principled approach based on optimality conditions.

Theoretical Foundation

Recall that at the optimum, the KKT conditions require: $$-y_i \nabla_i W \leq 0 \quad \text{if } \alpha_i = 0$$ $$-y_i \nabla_i W = 0 \quad \text{if } 0 < \alpha_i < C$$ $$-y_i \nabla_i W \geq 0 \quad \text{if } \alpha_i = C$$

where $\nabla_i W = 1 - y_i \sum_j \alpha_j y_j K_{ij}$ is the gradient component.

Define:

$I_{up} = {i : \alpha_i < C \text{ and } y_i = +1} \cup {i : \alpha_i > 0 \text{ and } y_i = -1}$
$I_{low} = {i : \alpha_i < C \text{ and } y_i = -1} \cup {i : \alpha_i > 0 \text{ and } y_i = +1}$

These sets represent:

$I_{up}$: Examples that can have their α increased
$I_{low}$: Examples that can have their α decreased

The MVP Selection Rule

Select (i, j) where: $$i = \arg\max_{t \in I_{up}} (-y_t \nabla_t W)$$ $$j = \arg\min_{t \in I_{low}} (-y_t \nabla_t W)$$

Convergence Check: The algorithm has converged when: $$\max_{t \in I_{up}} (-y_t \nabla_t W) - \min_{t \in I_{low}} (-y_t \nabla_t W) \leq \epsilon$$

This gap measures how close we are to satisfying KKT conditions.

Connection to Errors

The gradient -yᵢ∇ᵢW is closely related to the error Eᵢ. Specifically: -yᵢ∇ᵢW = -yᵢ + yᵢ²f(xᵢ) = f(xᵢ) - yᵢ = Eᵢ (with some abuse of notation regarding whether b is included). MVP effectively selects the most positive and most negative errors, similar to Platt's heuristic but with a formal optimality guarantee.

mvp_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
 
class MVPWorkingSet:
    """
    Maximum Violating Pair (MVP) working set selection.
    
    Also known as first-order working set selection because it
    uses gradient information (first derivative of objective).
    """
    
    def __init__(self, tol=1e-3):
        self.tol = tol
    
    def compute_gradient(self, alpha, y, K, b):
        """
        Compute gradient of dual objective.
        
        ∇ᵢW = 1 - yᵢ Σⱼ αⱼyⱼKᵢⱼ = 1 - yᵢf(xᵢ) (ignoring b for simplicity)
        
        The quantity we use for selection is -yᵢ∇ᵢW ≈ f(xᵢ) - yᵢ = Eᵢ
        """
        n = len(y)
        f = K @ (alpha * y) + b  # Decision function values
        return f - y  # This is just the error E_i
    
    def get_index_sets(self, alpha, y, C):
        """
        Compute I_up and I_low index sets.
        
        I_up: indices where α can increase
        I_low: indices where α can decrease
        """
        n = len(y)
        
        # I_up: (α < C and y = +1) or (α > 0 and y = -1)
        I_up = np.where(
            ((alpha < C) & (y == 1)) | 
            ((alpha > 0) & (y == -1))
        )[0]
        
        # I_low: (α < C and y = -1) or (α > 0 and y = +1)
        I_low = np.where(
            ((alpha < C) & (y == -1)) | 
            ((alpha > 0) & (y == 1))
        )[0]
        
        return I_up, I_low
    
    def select_working_set(self, alpha, y, K, b, C):
        """
        Select (i, j) using MVP strategy.
        
        Returns:
            i, j: Indices of selected pair (-1, -1 if converged)
            gap: The optimality gap (max - min)
        """
        # Compute gradient-related quantities (effectively errors)
        G = self.compute_gradient(alpha, y, K, b)
        
        # Get index sets
        I_up, I_low = self.get_index_sets(alpha, y, C)
        
        if len(I_up) == 0 or len(I_low) == 0:
            return -1, -1, 0.0  # Converged (degenerate case)
        
        # Find maximum violating pair
        # i = argmax over I_up of G (most positive error: should be +1 but f(x) >> 1)
        # j = argmin over I_low of G (most negative error: should be -1 but f(x) << -1)
        
        i = I_up[np.argmax(G[I_up])]
        j = I_low[np.argmin(G[I_low])]
        
        gap = G[i] - G[j]
        
        if gap < self.tol:
            return -1, -1, gap  # Converged
        
        return i, j, gap
    
    def compute_optimality_gap(self, alpha, y, K, b, C):
        """
        Compute the optimality gap without selecting.
        Used for monitoring convergence.
        """
        G = self.compute_gradient(alpha, y, K, b)
        I_up, I_low = self.get_index_sets(alpha, y, C)
        
        if len(I_up) == 0 or len(I_low) == 0:
            return 0.0
        
        return np.max(G[I_up]) - np.min(G[I_low])
 
 
# Demonstration
def demonstrate_mvp():
    """
    Show how MVP selects the maximum violating pair.
    """
    np.random.seed(42)
    
    # Small example: 6 training examples
    n = 6
    C = 1.0
    
    # Current state
    alpha = np.array([0.0, 0.5, 0.8, 0.0, 1.0, 0.3])
    y = np.array([1, 1, -1, -1, 1, -1])
    
    # Simulated errors (would come from kernel computation)
    G = np.array([-0.2, 0.4, -0.6, 0.3, 0.1, 0.5])
    
    mvp = MVPWorkingSet()
    I_up, I_low = mvp.get_index_sets(alpha, y, C)
    
    print("Alpha values:", alpha)
    print("Labels y:", y)
    print("Errors G:", G)
    print(f"\nI_up (can increase α): {I_up}")
    print(f"I_low (can decrease α): {I_low}")
    
    print("\nAnalyzing candidates:")
    print("  I_up members:")
    for idx in I_up:
        print(f"    i={idx}: α={alpha[idx]:.1f}, y={int(y[idx]):+d}, G={G[idx]:.2f}")
    print("  I_low members:")
    for idx in I_low:
        print(f"    j={idx}: α={alpha[idx]:.1f}, y={int(y[idx]):+d}, G={G[idx]:.2f}")
    
    # Maximum from I_up
    i = I_up[np.argmax(G[I_up])]
    # Minimum from I_low
    j = I_low[np.argmin(G[I_low])]
    gap = G[i] - G[j]
    
    print(f"\nSelected pair: i={i}, j={j}")
    print(f"Gap = G[{i}] - G[{j}] = {G[i]:.2f} - {G[j]:.2f} = {gap:.2f}")
 
demonstrate_mvp()

MVP vs. Platt's Heuristics

Aspect	Platt's Heuristics	MVP
Selection basis	Errors & non-bound status	Gradients & index sets
Convergence criterion	All KKT satisfied	Optimality gap < ε
Theoretical guarantee	Converges but slow	Converges with rate bounds
Computational cost	O(n) per iteration	O(n) per iteration
Practical speed	Good	Better

MVP provides both faster convergence and a clean theoretical framework. However, it still ignores second-order information (the curvature η), motivating further improvements.

Second-Order Working Set Selection

LIBSVM, the most widely-used SVM implementation, employs second-order working set selection that accounts for curvature. This dramatically improves convergence, especially for difficult problems.

The Limitation of First-Order Methods

MVP selects based on gradients alone. But the actual improvement from optimizing (i, j) depends on:

$$\Delta W = \frac{(y_i \nabla_i W - y_j \nabla_j W)^2}{2\eta_{ij}}$$

where $\eta_{ij} = K_{ii} + K_{jj} - 2K_{ij}$.

Two pairs can have identical gradient differences but vastly different η values:

Low η → flat objective → large step possible
High η → curved objective → small step possible

A first-order method might choose a pair with unfavorable curvature.

WSS2: Second-Order Selection for j

Chen, Fan, and Lin proposed selecting $i$ using MVP (first-order), then selecting $j$ using second-order information:

Selection of i (WSS2): $$i = \arg\max_{t \in I_{up}} (-y_t \nabla_t W)$$

Selection of j (WSS2): $$j = \arg\min_{t \in I_{low}} \frac{-(y_i \nabla_i W - y_t \nabla_t W)^2}{\eta_{it}}$$

This minimizes the objective function decrease divided by curvature, effectively maximizing actual progress.

Computational Trade-off

WSS2 requires computing ηᵢₜ = Kᵢᵢ + Kₜₜ - 2Kᵢₜ for all candidates t. Since Kₜₜ is fixed (diagonal elements), only Kᵢₜ must be computed. This adds O(|I_low|) kernel evaluations per iteration but usually pays for itself in reduced iteration count.

wss_second_order.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
 
class SecondOrderWSS:
    """
    LIBSVM-style second-order working set selection.
    
    Uses first-order selection for i, second-order for j.
    This is the core algorithm used in LIBSVM (WSS2/WSS3).
    """
    
    def __init__(self, kernel_fn, tol=1e-3, tau=1e-12):
        """
        Parameters:
            kernel_fn: Kernel function K(x_i, x_j)
            tol: Convergence tolerance
            tau: Small constant to prevent division by zero
        """
        self.kernel_fn = kernel_fn
        self.tol = tol
        self.tau = tau  # For numerical stability in η
    
    def select_working_set(self, alpha, y, X, Q_diag, grad, C):
        """
        Select (i, j) using second-order heuristic (WSS2).
        
        Parameters:
            alpha: Current Lagrange multipliers
            y: Labels
            X: Training data
            Q_diag: Diagonal of kernel matrix (precomputed)
            grad: Gradient -y ∘ ∇W
            C: Regularization parameter
        
        Returns:
            i, j: Selected indices (-1, -1 if converged)
        """
        n = len(alpha)
        
        # Get index sets
        I_up = np.where(
            ((alpha < C) & (y == 1)) | 
            ((alpha > 0) & (y == -1))
        )[0]
        
        I_low = np.where(
            ((alpha < C) & (y == -1)) | 
            ((alpha > 0) & (y == 1))
        )[0]
        
        if len(I_up) == 0 or len(I_low) == 0:
            return -1, -1
        
        # First-order selection of i: maximize G[t] over I_up
        i = I_up[np.argmax(grad[I_up])]
        G_i = grad[i]
        
        # Check if we can still make progress
        G_j_max = np.min(grad[I_low])
        if G_i - G_j_max < self.tol:
            return -1, -1  # Converged
        
        # Second-order selection of j
        # Minimize b²/a where:
        #   a = η_ij = Q_ii + Q_jj - 2*Q_ij
        #   b = G_i - G_j
        
        Q_ii = Q_diag[i]
        
        best_j = -1
        best_obj = float('inf')
        
        for t in I_low:
            b = G_i - grad[t]
            
            if b <= 0:
                continue  # Would not improve objective
            
            # Compute Q_it (requires kernel evaluation)
            Q_it = y[i] * y[t] * self.kernel_fn(X[i], X[t])
            
            # Compute η = Q_ii + Q_tt - 2*Q_it
            a = Q_ii + Q_diag[t] - 2 * Q_it
            a = max(a, self.tau)  # Ensure positive for stability
            
            # Objective: -b²/(2a) (we want most negative = largest decrease)
            # Equivalently minimize -b²/a
            obj = -(b * b) / a
            
            if obj < best_obj:
                best_obj = obj
                best_j = t
        
        return i, best_j
    
    def select_working_set_wss3(self, alpha, y, X, Q_diag, grad, C):
        """
        WSS3: Use second-order selection for BOTH i and j.
        
        This is even better than WSS2 but requires more computation.
        Used in some LIBSVM versions for specific kernel types.
        """
        n = len(alpha)
        
        I_up = np.where(
            ((alpha < C) & (y == 1)) | 
            ((alpha > 0) & (y == -1))
        )[0]
        
        I_low = np.where(
            ((alpha < C) & (y == -1)) | 
            ((alpha > 0) & (y == 1))
        )[0]
        
        if len(I_up) == 0 or len(I_low) == 0:
            return -1, -1
        
        best_i, best_j = -1, -1
        best_obj = float('inf')
        
        # Consider all pairs (i, j) from I_up × I_low
        for i in I_up:
            for j in I_low:
                b = grad[i] - grad[j]
                
                if b <= 0:
                    continue
                
                Q_ij = y[i] * y[j] * self.kernel_fn(X[i], X[j])
                a = Q_diag[i] + Q_diag[j] - 2 * Q_ij
                a = max(a, self.tau)
                
                obj = -(b * b) / a
                
                if obj < best_obj:
                    best_obj = obj
                    best_i = i
                    best_j = j
        
        return best_i, best_j
 
 
# Comparison demonstration
def compare_first_second_order():
    """
    Demonstrate difference between first and second-order selection.
    """
    # Scenario: two candidate j values with same gradient difference
    # but different curvature (η)
    
    print("Example: choosing between two candidates with same gradient gap")
    print("=" * 60)
    
    # First candidate: low curvature (flat objective)
    G_i = 0.8
    G_j1 = -0.4
    eta_j1 = 0.5  # Low curvature
    
    # Second candidate: same gradient gap but high curvature
    G_j2 = -0.4
    eta_j2 = 2.0  # High curvature
    
    # First-order would be indifferent (same |G_i - G_j|)
    gap_j1 = G_i - G_j1
    gap_j2 = G_i - G_j2
    print(f"\nFirst-order (gradient gap):")
    print(f"  Candidate j1: gap = {gap_j1:.2f}")
    print(f"  Candidate j2: gap = {gap_j2:.2f}")
    print(f"  First-order is INDIFFERENT")
    
    # Second-order prefers low curvature (larger actual step)
    obj_j1 = (gap_j1 ** 2) / (2 * eta_j1)
    obj_j2 = (gap_j2 ** 2) / (2 * eta_j2)
    print(f"\nSecond-order (objective improvement = gap²/2η):")
    print(f"  Candidate j1: Δobj = {obj_j1:.3f} (η={eta_j1})")
    print(f"  Candidate j2: Δobj = {obj_j2:.3f} (η={eta_j2})")
    print(f"  Second-order prefers j1 ({obj_j1:.3f} > {obj_j2:.3f})")
    print(f"\n  Ratio of improvements: j1/j2 = {obj_j1/obj_j2:.1f}x better!")
 
compare_first_second_order()

Practical Impact

Second-order selection can provide significant speedups:

Dataset	First-Order Iterations	Second-Order Iterations	Speedup
adult (32K)	45,000	18,000	2.5×
web (49K)	68,000	24,000	2.8×
ijcnn (49K)	22,000	8,000	2.75×
covtype (581K)	850,000	280,000	3.0×

The speedup tends to be larger for:

Datasets with complex decision boundaries
Kernels with varying bandwidth (like RBF)
Problems with many support vectors

For simple problems or linear kernels, the overhead of second-order selection may not pay off.

Shrinking and Caching

Two additional techniques are essential for practical SMO performance: shrinking and kernel caching. These optimizations can reduce training time by an order of magnitude.

Shrinking: Reducing the Active Set

During optimization, many variables quickly reach their bounds (αᵢ = 0 or αᵢ = C) and stay there. Shrinking removes these "inactive" variables from consideration:

The Idea: If αᵢ has been at a bound for several consecutive iterations, temporarily remove it from the working set candidate pool.

When to Remove: Variable i is shrunk if:

$\alpha_i = 0$ and $-y_i \nabla_i W < m_{up} - \epsilon$ (will stay at 0)
$\alpha_i = C$ and $-y_i \nabla_i W > M_{low} + \epsilon$ (will stay at C)

where $m_{up}$ and $M_{low}$ are the minimum/maximum gradient values over the active set.

Reconstruction: After convergence on the shrunk problem, all variables are temporarily un-shrunk to verify no violations were missed. This "reconstruction" step ensures correctness.

Shrinking Benefits

•Faster Iterations: Smaller candidate pool means less work per iteration
•Better Cache Hit Rate: Fewer active examples means kernel cache is more effective
•Earlier Termination: Can sometimes finish before considering all examples
•Memory Efficiency: Can discard cached kernel rows for shrunk examples

Shrinking Risks

•Premature Shrinking: May shrink variables that should have moved later
•Reconstruction Cost: Un-shrinking requires recomputing gradients for all
•Multiple Reconstructions: Aggressive shrinking may need several unshrink passes
•Implementation Complexity: Adds significant bookkeeping overhead

Kernel Caching

Kernel evaluations are typically the dominant cost in SVM training. For an RBF kernel: $$K(x_i, x_j) = \exp\left(-\gamma |x_i - x_j|^2\right)$$

Each evaluation requires O(d) operations for d-dimensional features. With millions of evaluations needed, caching becomes essential.

The Cache: Maintain an LRU (Least Recently Used) cache of kernel matrix columns:

When Kᵢⱼ is needed, check if column i or j is cached
If cached, retrieve in O(1)
If not, compute the entire column (O(n) evaluations) and cache it

Why Full Columns? SMO typically needs multiple entries from the same column (Kᵢⱼ for varying j with fixed i). Computing and caching an entire column amortizes the evaluation cost.

Cache Sizing: Typical systems allocate a few hundred MB to several GB for the kernel cache. The optimal size depends on:

Dataset size n
Number of active (non-shrunk) examples
Kernel evaluation cost

Empirical rule: cache size ≈ 100-400MB works well for most problems up to 100K examples.

kernel_cache.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
import numpy as np
from collections import OrderedDict
 
class KernelCache:
    """
    LRU cache for kernel matrix columns.
    
    Stores full columns to amortize computation cost when
    multiple elements from the same column are needed.
    """
    
    def __init__(self, kernel_fn, X, y, max_size_mb=400):
        """
        Parameters:
            kernel_fn: Kernel function K(x_i, x_j)
            X: Training data, shape (n, d)
            y: Labels, shape (n,)
            max_size_mb: Maximum cache size in megabytes
        """
        self.kernel_fn = kernel_fn
        self.X = X
        self.y = y
        self.n = X.shape[0]
        
        # Calculate how many columns we can store
        bytes_per_column = self.n * 8  # 64-bit floats
        self.max_columns = int(max_size_mb * 1e6 / bytes_per_column)
        self.max_columns = max(1, self.max_columns)  # At least 1
        
        # LRU cache using OrderedDict
        self.cache = OrderedDict()
        
        # Statistics
        self.hits = 0
        self.misses = 0
    
    def get_column(self, i):
        """
        Get the i-th column of the kernel matrix Q.
        Q_ij = y_i * y_j * K(x_i, x_j)
        
        Returns column as numpy array of length n.
        """
        if i in self.cache:
            # Cache hit: move to end (most recently used)
            self.cache.move_to_end(i)
            self.hits += 1
            return self.cache[i]
        
        # Cache miss: compute column
        self.misses += 1
        column = self._compute_column(i)
        
        # Evict if necessary
        while len(self.cache) >= self.max_columns:
            self.cache.popitem(last=False)  # Remove oldest
        
        self.cache[i] = column
        return column
    
    def get_element(self, i, j):
        """
        Get a single element Q_ij.
        This will cache the entire column i.
        """
        column = self.get_column(i)
        return column[j]
    
    def get_diagonal(self):
        """
        Get diagonal elements Q_ii.
        These are often needed repeatedly, so we precompute.
        """
        if not hasattr(self, '_diagonal'):
            self._diagonal = np.array([
                self.y[i]**2 * self.kernel_fn(self.X[i], self.X[i])
                for i in range(self.n)
            ])
        return self._diagonal
    
    def _compute_column(self, i):
        """
        Compute full column i of the Q matrix.
        """
        column = np.zeros(self.n)
        for j in range(self.n):
            column[j] = self.y[i] * self.y[j] * self.kernel_fn(self.X[i], self.X[j])
        return column
    
    def invalidate(self, indices):
        """
        Remove columns for given indices from cache.
        Called when alphas significantly change.
        """
        for i in indices:
            if i in self.cache:
                del self.cache[i]
    
    def stats(self):
        """Return cache hit statistics."""
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return {
            'hits': self.hits,
            'misses': self.misses,
            'hit_rate': hit_rate,
            'columns_cached': len(self.cache),
            'max_columns': self.max_columns
        }
 
 
class ShrinkingManager:
    """
    Manages shrinking of inactive variables in SMO.
    """
    
    def __init__(self, n, tol=1e-3, shrink_threshold=1000):
        """
        Parameters:
            n: Number of training examples
            tol: KKT tolerance
            shrink_threshold: Iterations before attempting shrink
        """
        self.n = n
        self.tol = tol
        self.shrink_threshold = shrink_threshold
        
        # Track which examples are active
        self.active = np.ones(n, dtype=bool)
        self.active_count = n
        
        # Iteration counter
        self.iteration = 0
    
    def should_shrink(self):
        """Check if we should attempt shrinking."""
        return self.iteration > 0 and self.iteration % self.shrink_threshold == 0
    
    def shrink(self, alpha, y, grad, C):
        """
        Shrink inactive variables from the active set.
        
        Parameters:
            alpha: Current Lagrange multipliers
            y: Labels
            grad: Gradient values (G = -y * ∇W)
            C: Regularization parameter
        
        Returns:
            Number of newly shrunk variables
        """
        # Compute bounds over currently active set
        active_mask = self.active
        
        # m_up = max G over {i : α_i < C and y_i = +1} ∪ {i : α_i > 0 and y_i = -1}
        up_mask = active_mask & (((alpha < C) & (y == 1)) | ((alpha > 0) & (y == -1)))
        low_mask = active_mask & (((alpha < C) & (y == -1)) | ((alpha > 0) & (y == 1)))
        
        if np.sum(up_mask) == 0 or np.sum(low_mask) == 0:
            return 0
        
        G_max = np.max(grad[up_mask])
        G_min = np.min(grad[low_mask])
        
        # Shrink examples that are at bounds and won't move
        shrink_count = 0
        
        for i in range(self.n):
            if not self.active[i]:
                continue
            
            if alpha[i] == 0:
                # Will stay at 0 if G[i] is much less than G_max
                if grad[i] < G_max - self.tol:
                    self.active[i] = False
                    shrink_count += 1
            
            elif alpha[i] == C:
                # Will stay at C if G[i] is much greater than G_min
                if grad[i] > G_min + self.tol:
                    self.active[i] = False
                    shrink_count += 1
        
        self.active_count = np.sum(self.active)
        return shrink_count
    
    def unshrink_all(self):
        """
        Unshrink all variables for final verification.
        Called before termination.
        """
        self.active = np.ones(self.n, dtype=bool)
        self.active_count = self.n
    
    def get_active_indices(self):
        """Return indices of active variables."""
        return np.where(self.active)[0]
    
    def increment_iteration(self):
        """Increment iteration counter."""
        self.iteration += 1
 
 
# Demonstration
def demonstrate_shrinking():
    """Show shrinking in action."""
    np.random.seed(42)
    n = 100
    
    # Simulate: most alphas at 0, some at bounds, few active
    alpha = np.zeros(n)
    alpha[10:20] = 0.5  # Free support vectors
    alpha[20:30] = 1.0  # Bound support vectors (α = C)
    
    y = np.random.choice([-1, 1], n)
    grad = np.random.randn(n) * 0.5
    grad[20:30] += 2.0  # Bound SVs have high gradient (won't move)
    
    C = 1.0
    
    manager = ShrinkingManager(n, tol=0.1, shrink_threshold=1)
    manager.iteration = 1  # Trigger shrinking
    
    print(f"Initial active examples: {manager.active_count}")
    
    shrunk = manager.shrink(alpha, y, grad, C)
    print(f"After shrinking: {manager.active_count} active ({shrunk} shrunk)")
    print(f"Active indices: {manager.get_active_indices()[:20]}...")  # First 20
 
demonstrate_shrinking()

Summary: The Art of Smart Selection

Working set selection transforms SMO from a correct but slow algorithm into a practical optimization workhorse. We've covered the evolution from simple heuristics to sophisticated second-order methods:

Key Takeaways

•Selection Quality Matters: The difference between naive and sophisticated selection can be 100× in iteration count and 10× in wall-clock time.
•Platt's Heuristics: The foundation—prioritize non-bound examples, maximize |E₁ - E₂|. Good but ignores curvature.
•Maximum Violating Pair: First-order method with optimality guarantees. Selects examples with largest gradient gap.
•Second-Order Selection: LIBSVM's WSS2/WSS3. Accounts for curvature (η) to estimate actual improvement. Significantly faster for complex problems.
•Shrinking: Remove variables at bounds from consideration. Reduces iteration cost by focusing on active examples.
•Kernel Caching: LRU cache for kernel columns. Essential when kernel evaluation dominates runtime.

What's Next

We've covered the mechanisms of SMO—how variables are selected and optimized. In the next page, we'll examine Convergence—the theoretical guarantees that ensure SMO terminates with the correct solution, and the practical factors that affect how quickly this happens.

Understanding convergence is essential for:

Setting appropriate tolerance parameters
Debugging slow-converging problems
Knowing when approximate solutions are acceptable

Page Complete

You now understand the working set selection strategies that make SMO practical. From Platt's intuitive heuristics to LIBSVM's second-order methods, you can trace the algorithmic innovations that enable SVMs to train on datasets of meaningful size.