Loading content...
The SMO algorithm's correctness is guaranteed for any valid pair selection—even random choices would eventually converge. But in practice, the choice of which pair to optimize can affect total runtime by orders of magnitude.
Consider a dataset with 100,000 training examples. Random pair selection might require millions of iterations to converge. Intelligent selection—using heuristics that identify the pairs most likely to make progress—can reduce this to tens of thousands. The difference between a computation finishing in minutes versus days often comes down to sophisticated working set selection.
This page explores the evolution of working set selection strategies, from Platt's original heuristics to the second-order methods used in modern implementations like LIBSVM. You'll understand not just what these heuristics do, but why they work—enabling you to tune and debug SVM training effectively.
By the end of this page, you will understand: (1) why naive selection strategies fail, (2) Platt's original first-choice and second-choice heuristics, (3) the maximum violating pair (MVP) strategy, (4) second-order working set selection (WSS2/WSS3), (5) shrinking—the technique that prunes the working set, and (6) caching strategies for kernel evaluations.
At each SMO iteration, we select two variables (αᵢ, αⱼ) to jointly optimize. With n training examples, there are $\binom{n}{2} = \frac{n(n-1)}{2}$ possible pairs. For n = 10,000, that's nearly 50 million options.
Not all pairs are equally useful:
Useless Pairs:
Highly Useful Pairs:
Empirical studies show dramatic differences in iteration counts:
| Selection Strategy | Iterations to Converge | Kernel Evaluations | Training Time |
|---|---|---|---|
| Random selection | ~2,500,000 | ~80 billion | Hours |
| Platt's heuristics | ~150,000 | ~5 billion | Minutes |
| MVP (first-order) | ~45,000 | ~1.5 billion | ~1 minute |
| WSS3 (second-order) | ~18,000 | ~600 million | ~30 seconds |
The 140× reduction in iterations between random and WSS3 represents the accumulated insights of two decades of algorithm development. Let's trace this evolution.
Better selection strategies require more computation per iteration:
The key insight is that kernel evaluations dominate SVM training cost. A strategy that doubles selection overhead but halves iterations saves 50% of total kernel evaluations—usually a net win.
SMO uses a working set of size 2. Some algorithms (like SVMlight) use larger working sets of 10-50 variables, solving the resulting QP with an off-the-shelf solver. Larger working sets make more progress per iteration but require more complex QP solvers. SMO's working set of 2 allows analytical solutions, trading iteration count for simplicity.
John Platt's 1998 paper introduced two heuristics that made SMO practical: first-choice heuristic for selecting the first variable, and second-choice heuristic for selecting the second.
The first variable is chosen by cycling through examples and looking for KKT violations:
Strategy:
Rationale: Non-bound examples are the most "active"—they directly determine the decision boundary. An example at a bound (αᵢ = 0 or αᵢ = C) has effectively been decided: it's either not a support vector or it's being maximally penalized.
By focusing on non-bound examples first, SMO makes maximum progress on the examples that matter most. Full passes are done periodically to discover new support vectors (examples that should move away from αᵢ = 0).
Given the first variable i₁ with error E₁, the second variable is chosen to maximize expected progress:
The Maximum Step Heuristic:
Choose i₂ to maximize |E₁ - E₂|.
Recall that the step size for α₂ is: $$\Delta\alpha_2 = \frac{y_2(E_1 - E_2)}{\eta}$$
For a fixed η (which we don't know without computing kernel values), maximizing |E₁ - E₂| maximizes the step size.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
import numpy as np class PlattSMO: """ SMO with Platt's original heuristics for working set selection. """ def __init__(self, kernel_fn, C, tol=1e-3, eps=1e-5): self.kernel_fn = kernel_fn self.C = C self.tol = tol # Tolerance for KKT checking self.eps = eps # Minimum change threshold def first_choice_heuristic(self, alphas, errors, y, C): """ First-choice heuristic: select first variable for optimization. Priority: 1. Non-bound examples (0 < α < C) that violate KKT 2. All examples that violate KKT Returns: index of first variable, or -1 if none found """ n = len(alphas) # First pass: check non-bound examples only non_bound_indices = np.where((alphas > 0) & (alphas < C))[0] for i in non_bound_indices: if self._violates_kkt(alphas[i], y[i] * errors[i]): return i # Second pass: check all examples for i in range(n): if self._violates_kkt(alphas[i], y[i] * errors[i]): return i return -1 # No KKT violation found def second_choice_heuristic(self, i1, E1, errors, alphas, y, X, C): """ Second-choice heuristic: select second variable for optimization. Strategy: Choose i2 to maximize |E1 - E2| If no non-bound examples exist, fall back to random selection. """ n = len(alphas) i2 = -1 max_delta_E = 0 # First, look among non-bound examples for max |E1 - E2| non_bound_indices = np.where((alphas > 0) & (alphas < C))[0] for i in non_bound_indices: if i == i1: continue delta_E = abs(E1 - errors[i]) if delta_E > max_delta_E: max_delta_E = delta_E i2 = i if i2 >= 0: return i2 # No suitable non-bound example found # Fall back to checking all examples # Use hierarchical search: # 1. Try non-bound examples (random order) np.random.shuffle(non_bound_indices) for i in non_bound_indices: if i != i1: return i # 2. Try bound examples (random order) all_indices = np.arange(n) np.random.shuffle(all_indices) for i in all_indices: if i != i1: return i return -1 # Never reached if n > 1 def _violates_kkt(self, alpha, r): """ Check if KKT conditions are violated. r = y * E = y * (f(x) - y) = y * f(x) - 1 """ return ((r < -self.tol and alpha < self.C) or (r > self.tol and alpha > 0)) def select_working_set(self, alphas, errors, y, X, C): """ Select the pair (i1, i2) to optimize using Platt's heuristics. Returns: (i1, i2) or (-1, -1) if converged """ # First choice: find a KKT violator i1 = self.first_choice_heuristic(alphas, errors, y, C) if i1 < 0: return -1, -1 # Converged E1 = errors[i1] # Second choice: maximize |E1 - E2| i2 = self.second_choice_heuristic(i1, E1, errors, alphas, y, X, C) return i1, i2 # Example demonstrating the heuristicdef demonstrate_heuristic(): """ Show how second-choice picks the variable with maximum error difference. """ np.random.seed(42) n = 10 # Simulated state: some alphas, errors alphas = np.array([0.0, 0.5, 0.0, 0.3, 1.0, 0.7, 0.0, 0.2, 1.0, 0.4]) errors = np.array([-0.1, 0.3, 0.0, -0.5, 0.2, 0.8, -0.2, 0.1, 0.0, -0.4]) # Suppose first choice selected i1=3 with E1=-0.5 i1 = 3 E1 = errors[i1] # Non-bound examples: indices where 0 < alpha < C C = 1.0 non_bound = np.where((alphas > 0) & (alphas < C))[0] print(f"Non-bound examples: {non_bound}") print(f"First choice: i1={i1}, E1={E1}") # Find max |E1 - E2| among non-bound for i in non_bound: if i != i1: print(f" i={i}: E[{i}]={errors[i]:.2f}, |E1-E2|={abs(E1-errors[i]):.2f}") # The maximum is E1=-0.5 vs E5=0.8, giving |E1-E2|=1.3 print("\nSecond choice selects i2=5 (largest |E1-E2|=1.3)") demonstrate_heuristic()While groundbreaking, Platt's heuristics have shortcomings:
Ignores η: The step size depends on |E₁ - E₂|/η, but Platt's heuristic ignores η. Two examples with similar errors but very different η values are treated the same.
Error Cache Staleness: Errors change as optimization proceeds, but are only updated after full passes. The cached errors may not reflect current model state.
No Lookahead: The heuristic optimizes locally without considering whether the chosen pair will make global progress toward the solution.
These limitations motivated the development of more sophisticated selection strategies.
The Maximum Violating Pair (MVP) strategy, also known as first-order working set selection, provides a principled approach based on optimality conditions.
Recall that at the optimum, the KKT conditions require: $$-y_i \nabla_i W \leq 0 \quad \text{if } \alpha_i = 0$$ $$-y_i \nabla_i W = 0 \quad \text{if } 0 < \alpha_i < C$$ $$-y_i \nabla_i W \geq 0 \quad \text{if } \alpha_i = C$$
where $\nabla_i W = 1 - y_i \sum_j \alpha_j y_j K_{ij}$ is the gradient component.
Define:
These sets represent:
Select (i, j) where: $$i = \arg\max_{t \in I_{up}} (-y_t \nabla_t W)$$ $$j = \arg\min_{t \in I_{low}} (-y_t \nabla_t W)$$
Convergence Check: The algorithm has converged when: $$\max_{t \in I_{up}} (-y_t \nabla_t W) - \min_{t \in I_{low}} (-y_t \nabla_t W) \leq \epsilon$$
This gap measures how close we are to satisfying KKT conditions.
The gradient -yᵢ∇ᵢW is closely related to the error Eᵢ. Specifically: -yᵢ∇ᵢW = -yᵢ + yᵢ²f(xᵢ) = f(xᵢ) - yᵢ = Eᵢ (with some abuse of notation regarding whether b is included). MVP effectively selects the most positive and most negative errors, similar to Platt's heuristic but with a formal optimality guarantee.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
import numpy as np class MVPWorkingSet: """ Maximum Violating Pair (MVP) working set selection. Also known as first-order working set selection because it uses gradient information (first derivative of objective). """ def __init__(self, tol=1e-3): self.tol = tol def compute_gradient(self, alpha, y, K, b): """ Compute gradient of dual objective. ∇ᵢW = 1 - yᵢ Σⱼ αⱼyⱼKᵢⱼ = 1 - yᵢf(xᵢ) (ignoring b for simplicity) The quantity we use for selection is -yᵢ∇ᵢW ≈ f(xᵢ) - yᵢ = Eᵢ """ n = len(y) f = K @ (alpha * y) + b # Decision function values return f - y # This is just the error E_i def get_index_sets(self, alpha, y, C): """ Compute I_up and I_low index sets. I_up: indices where α can increase I_low: indices where α can decrease """ n = len(y) # I_up: (α < C and y = +1) or (α > 0 and y = -1) I_up = np.where( ((alpha < C) & (y == 1)) | ((alpha > 0) & (y == -1)) )[0] # I_low: (α < C and y = -1) or (α > 0 and y = +1) I_low = np.where( ((alpha < C) & (y == -1)) | ((alpha > 0) & (y == 1)) )[0] return I_up, I_low def select_working_set(self, alpha, y, K, b, C): """ Select (i, j) using MVP strategy. Returns: i, j: Indices of selected pair (-1, -1 if converged) gap: The optimality gap (max - min) """ # Compute gradient-related quantities (effectively errors) G = self.compute_gradient(alpha, y, K, b) # Get index sets I_up, I_low = self.get_index_sets(alpha, y, C) if len(I_up) == 0 or len(I_low) == 0: return -1, -1, 0.0 # Converged (degenerate case) # Find maximum violating pair # i = argmax over I_up of G (most positive error: should be +1 but f(x) >> 1) # j = argmin over I_low of G (most negative error: should be -1 but f(x) << -1) i = I_up[np.argmax(G[I_up])] j = I_low[np.argmin(G[I_low])] gap = G[i] - G[j] if gap < self.tol: return -1, -1, gap # Converged return i, j, gap def compute_optimality_gap(self, alpha, y, K, b, C): """ Compute the optimality gap without selecting. Used for monitoring convergence. """ G = self.compute_gradient(alpha, y, K, b) I_up, I_low = self.get_index_sets(alpha, y, C) if len(I_up) == 0 or len(I_low) == 0: return 0.0 return np.max(G[I_up]) - np.min(G[I_low]) # Demonstrationdef demonstrate_mvp(): """ Show how MVP selects the maximum violating pair. """ np.random.seed(42) # Small example: 6 training examples n = 6 C = 1.0 # Current state alpha = np.array([0.0, 0.5, 0.8, 0.0, 1.0, 0.3]) y = np.array([1, 1, -1, -1, 1, -1]) # Simulated errors (would come from kernel computation) G = np.array([-0.2, 0.4, -0.6, 0.3, 0.1, 0.5]) mvp = MVPWorkingSet() I_up, I_low = mvp.get_index_sets(alpha, y, C) print("Alpha values:", alpha) print("Labels y:", y) print("Errors G:", G) print(f"\nI_up (can increase α): {I_up}") print(f"I_low (can decrease α): {I_low}") print("\nAnalyzing candidates:") print(" I_up members:") for idx in I_up: print(f" i={idx}: α={alpha[idx]:.1f}, y={int(y[idx]):+d}, G={G[idx]:.2f}") print(" I_low members:") for idx in I_low: print(f" j={idx}: α={alpha[idx]:.1f}, y={int(y[idx]):+d}, G={G[idx]:.2f}") # Maximum from I_up i = I_up[np.argmax(G[I_up])] # Minimum from I_low j = I_low[np.argmin(G[I_low])] gap = G[i] - G[j] print(f"\nSelected pair: i={i}, j={j}") print(f"Gap = G[{i}] - G[{j}] = {G[i]:.2f} - {G[j]:.2f} = {gap:.2f}") demonstrate_mvp()| Aspect | Platt's Heuristics | MVP |
|---|---|---|
| Selection basis | Errors & non-bound status | Gradients & index sets |
| Convergence criterion | All KKT satisfied | Optimality gap < ε |
| Theoretical guarantee | Converges but slow | Converges with rate bounds |
| Computational cost | O(n) per iteration | O(n) per iteration |
| Practical speed | Good | Better |
MVP provides both faster convergence and a clean theoretical framework. However, it still ignores second-order information (the curvature η), motivating further improvements.
LIBSVM, the most widely-used SVM implementation, employs second-order working set selection that accounts for curvature. This dramatically improves convergence, especially for difficult problems.
MVP selects based on gradients alone. But the actual improvement from optimizing (i, j) depends on:
$$\Delta W = \frac{(y_i \nabla_i W - y_j \nabla_j W)^2}{2\eta_{ij}}$$
where $\eta_{ij} = K_{ii} + K_{jj} - 2K_{ij}$.
Two pairs can have identical gradient differences but vastly different η values:
A first-order method might choose a pair with unfavorable curvature.
Chen, Fan, and Lin proposed selecting $i$ using MVP (first-order), then selecting $j$ using second-order information:
Selection of i (WSS2): $$i = \arg\max_{t \in I_{up}} (-y_t \nabla_t W)$$
Selection of j (WSS2): $$j = \arg\min_{t \in I_{low}} \frac{-(y_i \nabla_i W - y_t \nabla_t W)^2}{\eta_{it}}$$
This minimizes the objective function decrease divided by curvature, effectively maximizing actual progress.
WSS2 requires computing ηᵢₜ = Kᵢᵢ + Kₜₜ - 2Kᵢₜ for all candidates t. Since Kₜₜ is fixed (diagonal elements), only Kᵢₜ must be computed. This adds O(|I_low|) kernel evaluations per iteration but usually pays for itself in reduced iteration count.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
import numpy as np class SecondOrderWSS: """ LIBSVM-style second-order working set selection. Uses first-order selection for i, second-order for j. This is the core algorithm used in LIBSVM (WSS2/WSS3). """ def __init__(self, kernel_fn, tol=1e-3, tau=1e-12): """ Parameters: kernel_fn: Kernel function K(x_i, x_j) tol: Convergence tolerance tau: Small constant to prevent division by zero """ self.kernel_fn = kernel_fn self.tol = tol self.tau = tau # For numerical stability in η def select_working_set(self, alpha, y, X, Q_diag, grad, C): """ Select (i, j) using second-order heuristic (WSS2). Parameters: alpha: Current Lagrange multipliers y: Labels X: Training data Q_diag: Diagonal of kernel matrix (precomputed) grad: Gradient -y ∘ ∇W C: Regularization parameter Returns: i, j: Selected indices (-1, -1 if converged) """ n = len(alpha) # Get index sets I_up = np.where( ((alpha < C) & (y == 1)) | ((alpha > 0) & (y == -1)) )[0] I_low = np.where( ((alpha < C) & (y == -1)) | ((alpha > 0) & (y == 1)) )[0] if len(I_up) == 0 or len(I_low) == 0: return -1, -1 # First-order selection of i: maximize G[t] over I_up i = I_up[np.argmax(grad[I_up])] G_i = grad[i] # Check if we can still make progress G_j_max = np.min(grad[I_low]) if G_i - G_j_max < self.tol: return -1, -1 # Converged # Second-order selection of j # Minimize b²/a where: # a = η_ij = Q_ii + Q_jj - 2*Q_ij # b = G_i - G_j Q_ii = Q_diag[i] best_j = -1 best_obj = float('inf') for t in I_low: b = G_i - grad[t] if b <= 0: continue # Would not improve objective # Compute Q_it (requires kernel evaluation) Q_it = y[i] * y[t] * self.kernel_fn(X[i], X[t]) # Compute η = Q_ii + Q_tt - 2*Q_it a = Q_ii + Q_diag[t] - 2 * Q_it a = max(a, self.tau) # Ensure positive for stability # Objective: -b²/(2a) (we want most negative = largest decrease) # Equivalently minimize -b²/a obj = -(b * b) / a if obj < best_obj: best_obj = obj best_j = t return i, best_j def select_working_set_wss3(self, alpha, y, X, Q_diag, grad, C): """ WSS3: Use second-order selection for BOTH i and j. This is even better than WSS2 but requires more computation. Used in some LIBSVM versions for specific kernel types. """ n = len(alpha) I_up = np.where( ((alpha < C) & (y == 1)) | ((alpha > 0) & (y == -1)) )[0] I_low = np.where( ((alpha < C) & (y == -1)) | ((alpha > 0) & (y == 1)) )[0] if len(I_up) == 0 or len(I_low) == 0: return -1, -1 best_i, best_j = -1, -1 best_obj = float('inf') # Consider all pairs (i, j) from I_up × I_low for i in I_up: for j in I_low: b = grad[i] - grad[j] if b <= 0: continue Q_ij = y[i] * y[j] * self.kernel_fn(X[i], X[j]) a = Q_diag[i] + Q_diag[j] - 2 * Q_ij a = max(a, self.tau) obj = -(b * b) / a if obj < best_obj: best_obj = obj best_i = i best_j = j return best_i, best_j # Comparison demonstrationdef compare_first_second_order(): """ Demonstrate difference between first and second-order selection. """ # Scenario: two candidate j values with same gradient difference # but different curvature (η) print("Example: choosing between two candidates with same gradient gap") print("=" * 60) # First candidate: low curvature (flat objective) G_i = 0.8 G_j1 = -0.4 eta_j1 = 0.5 # Low curvature # Second candidate: same gradient gap but high curvature G_j2 = -0.4 eta_j2 = 2.0 # High curvature # First-order would be indifferent (same |G_i - G_j|) gap_j1 = G_i - G_j1 gap_j2 = G_i - G_j2 print(f"\nFirst-order (gradient gap):") print(f" Candidate j1: gap = {gap_j1:.2f}") print(f" Candidate j2: gap = {gap_j2:.2f}") print(f" First-order is INDIFFERENT") # Second-order prefers low curvature (larger actual step) obj_j1 = (gap_j1 ** 2) / (2 * eta_j1) obj_j2 = (gap_j2 ** 2) / (2 * eta_j2) print(f"\nSecond-order (objective improvement = gap²/2η):") print(f" Candidate j1: Δobj = {obj_j1:.3f} (η={eta_j1})") print(f" Candidate j2: Δobj = {obj_j2:.3f} (η={eta_j2})") print(f" Second-order prefers j1 ({obj_j1:.3f} > {obj_j2:.3f})") print(f"\n Ratio of improvements: j1/j2 = {obj_j1/obj_j2:.1f}x better!") compare_first_second_order()Second-order selection can provide significant speedups:
| Dataset | First-Order Iterations | Second-Order Iterations | Speedup |
|---|---|---|---|
| adult (32K) | 45,000 | 18,000 | 2.5× |
| web (49K) | 68,000 | 24,000 | 2.8× |
| ijcnn (49K) | 22,000 | 8,000 | 2.75× |
| covtype (581K) | 850,000 | 280,000 | 3.0× |
The speedup tends to be larger for:
For simple problems or linear kernels, the overhead of second-order selection may not pay off.
Two additional techniques are essential for practical SMO performance: shrinking and kernel caching. These optimizations can reduce training time by an order of magnitude.
During optimization, many variables quickly reach their bounds (αᵢ = 0 or αᵢ = C) and stay there. Shrinking removes these "inactive" variables from consideration:
The Idea: If αᵢ has been at a bound for several consecutive iterations, temporarily remove it from the working set candidate pool.
When to Remove: Variable i is shrunk if:
where $m_{up}$ and $M_{low}$ are the minimum/maximum gradient values over the active set.
Reconstruction: After convergence on the shrunk problem, all variables are temporarily un-shrunk to verify no violations were missed. This "reconstruction" step ensures correctness.
Kernel evaluations are typically the dominant cost in SVM training. For an RBF kernel: $$K(x_i, x_j) = \exp\left(-\gamma |x_i - x_j|^2\right)$$
Each evaluation requires O(d) operations for d-dimensional features. With millions of evaluations needed, caching becomes essential.
The Cache: Maintain an LRU (Least Recently Used) cache of kernel matrix columns:
Why Full Columns? SMO typically needs multiple entries from the same column (Kᵢⱼ for varying j with fixed i). Computing and caching an entire column amortizes the evaluation cost.
Cache Sizing: Typical systems allocate a few hundred MB to several GB for the kernel cache. The optimal size depends on:
Empirical rule: cache size ≈ 100-400MB works well for most problems up to 100K examples.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230
import numpy as npfrom collections import OrderedDict class KernelCache: """ LRU cache for kernel matrix columns. Stores full columns to amortize computation cost when multiple elements from the same column are needed. """ def __init__(self, kernel_fn, X, y, max_size_mb=400): """ Parameters: kernel_fn: Kernel function K(x_i, x_j) X: Training data, shape (n, d) y: Labels, shape (n,) max_size_mb: Maximum cache size in megabytes """ self.kernel_fn = kernel_fn self.X = X self.y = y self.n = X.shape[0] # Calculate how many columns we can store bytes_per_column = self.n * 8 # 64-bit floats self.max_columns = int(max_size_mb * 1e6 / bytes_per_column) self.max_columns = max(1, self.max_columns) # At least 1 # LRU cache using OrderedDict self.cache = OrderedDict() # Statistics self.hits = 0 self.misses = 0 def get_column(self, i): """ Get the i-th column of the kernel matrix Q. Q_ij = y_i * y_j * K(x_i, x_j) Returns column as numpy array of length n. """ if i in self.cache: # Cache hit: move to end (most recently used) self.cache.move_to_end(i) self.hits += 1 return self.cache[i] # Cache miss: compute column self.misses += 1 column = self._compute_column(i) # Evict if necessary while len(self.cache) >= self.max_columns: self.cache.popitem(last=False) # Remove oldest self.cache[i] = column return column def get_element(self, i, j): """ Get a single element Q_ij. This will cache the entire column i. """ column = self.get_column(i) return column[j] def get_diagonal(self): """ Get diagonal elements Q_ii. These are often needed repeatedly, so we precompute. """ if not hasattr(self, '_diagonal'): self._diagonal = np.array([ self.y[i]**2 * self.kernel_fn(self.X[i], self.X[i]) for i in range(self.n) ]) return self._diagonal def _compute_column(self, i): """ Compute full column i of the Q matrix. """ column = np.zeros(self.n) for j in range(self.n): column[j] = self.y[i] * self.y[j] * self.kernel_fn(self.X[i], self.X[j]) return column def invalidate(self, indices): """ Remove columns for given indices from cache. Called when alphas significantly change. """ for i in indices: if i in self.cache: del self.cache[i] def stats(self): """Return cache hit statistics.""" total = self.hits + self.misses hit_rate = self.hits / total if total > 0 else 0 return { 'hits': self.hits, 'misses': self.misses, 'hit_rate': hit_rate, 'columns_cached': len(self.cache), 'max_columns': self.max_columns } class ShrinkingManager: """ Manages shrinking of inactive variables in SMO. """ def __init__(self, n, tol=1e-3, shrink_threshold=1000): """ Parameters: n: Number of training examples tol: KKT tolerance shrink_threshold: Iterations before attempting shrink """ self.n = n self.tol = tol self.shrink_threshold = shrink_threshold # Track which examples are active self.active = np.ones(n, dtype=bool) self.active_count = n # Iteration counter self.iteration = 0 def should_shrink(self): """Check if we should attempt shrinking.""" return self.iteration > 0 and self.iteration % self.shrink_threshold == 0 def shrink(self, alpha, y, grad, C): """ Shrink inactive variables from the active set. Parameters: alpha: Current Lagrange multipliers y: Labels grad: Gradient values (G = -y * ∇W) C: Regularization parameter Returns: Number of newly shrunk variables """ # Compute bounds over currently active set active_mask = self.active # m_up = max G over {i : α_i < C and y_i = +1} ∪ {i : α_i > 0 and y_i = -1} up_mask = active_mask & (((alpha < C) & (y == 1)) | ((alpha > 0) & (y == -1))) low_mask = active_mask & (((alpha < C) & (y == -1)) | ((alpha > 0) & (y == 1))) if np.sum(up_mask) == 0 or np.sum(low_mask) == 0: return 0 G_max = np.max(grad[up_mask]) G_min = np.min(grad[low_mask]) # Shrink examples that are at bounds and won't move shrink_count = 0 for i in range(self.n): if not self.active[i]: continue if alpha[i] == 0: # Will stay at 0 if G[i] is much less than G_max if grad[i] < G_max - self.tol: self.active[i] = False shrink_count += 1 elif alpha[i] == C: # Will stay at C if G[i] is much greater than G_min if grad[i] > G_min + self.tol: self.active[i] = False shrink_count += 1 self.active_count = np.sum(self.active) return shrink_count def unshrink_all(self): """ Unshrink all variables for final verification. Called before termination. """ self.active = np.ones(self.n, dtype=bool) self.active_count = self.n def get_active_indices(self): """Return indices of active variables.""" return np.where(self.active)[0] def increment_iteration(self): """Increment iteration counter.""" self.iteration += 1 # Demonstrationdef demonstrate_shrinking(): """Show shrinking in action.""" np.random.seed(42) n = 100 # Simulate: most alphas at 0, some at bounds, few active alpha = np.zeros(n) alpha[10:20] = 0.5 # Free support vectors alpha[20:30] = 1.0 # Bound support vectors (α = C) y = np.random.choice([-1, 1], n) grad = np.random.randn(n) * 0.5 grad[20:30] += 2.0 # Bound SVs have high gradient (won't move) C = 1.0 manager = ShrinkingManager(n, tol=0.1, shrink_threshold=1) manager.iteration = 1 # Trigger shrinking print(f"Initial active examples: {manager.active_count}") shrunk = manager.shrink(alpha, y, grad, C) print(f"After shrinking: {manager.active_count} active ({shrunk} shrunk)") print(f"Active indices: {manager.get_active_indices()[:20]}...") # First 20 demonstrate_shrinking()Working set selection transforms SMO from a correct but slow algorithm into a practical optimization workhorse. We've covered the evolution from simple heuristics to sophisticated second-order methods:
We've covered the mechanisms of SMO—how variables are selected and optimized. In the next page, we'll examine Convergence—the theoretical guarantees that ensure SMO terminates with the correct solution, and the practical factors that affect how quickly this happens.
Understanding convergence is essential for:
You now understand the working set selection strategies that make SMO practical. From Platt's intuitive heuristics to LIBSVM's second-order methods, you can trace the algorithmic innovations that enable SVMs to train on datasets of meaningful size.