Loading learning content...
An optimization algorithm is only as good as its convergence properties. For SMO, we need to answer several critical questions:
These questions move from theoretical (guaranteed correctness) to practical (engineering performance). A practitioner needs both: confidence that the algorithm will eventually succeed, and knowledge to make it succeed quickly.
This page provides deep coverage of convergence theory and practice, equipping you to understand, tune, and troubleshoot SVM training.
By the end of this page, you will understand: (1) proof that SMO converges to the global optimum, (2) convergence rate characterization, (3) stopping criteria and their implications, (4) factors that accelerate or slow convergence, (5) convergence monitoring and diagnostics, and (6) handling pathological cases.
The most fundamental property of SMO is that it always converges to the global optimal solution, assuming proper working set selection. This guarantee relies on several mathematical properties of the SVM problem.
The SVM dual objective is: $$W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
In matrix notation: $$W(\alpha) = \mathbf{1}^T \alpha - \frac{1}{2} \alpha^T Q \alpha$$
where $Q_{ij} = y_i y_j K(x_i, x_j)$.
For a valid (Mercer) kernel, $Q$ is positive semi-definite, making $-W(\alpha)$ a convex function. Convex functions have a unique global minimum (or equivalently, $W$ has a unique global maximum).
Why This Matters:
At each SMO iteration, we optimize two variables while holding others fixed. The key theorem states:
Theorem (Sufficient Decrease): If $(\alpha_i, \alpha_j)$ violate the KKT conditions and are jointly optimized, the objective $W(\alpha)$ strictly increases (or variables are at bounds and cannot move).
The strict increase in W is crucial. It means we're always making progress—the objective never decreases or stays flat (except when we're already at the optimum). This prevents cycling and guarantees eventual convergence.
Theoretically, SMO has two convergence modes:
1. Finite Convergence (Ideal): With exact arithmetic, SMO would identify the exact set of support vectors in finitely many steps, then terminate. Each step makes discrete progress toward discovering whether each αᵢ is at 0, at C, or strictly between.
2. Asymptotic Convergence (Reality): With floating-point arithmetic, we never reach the exact optimal solution. Instead, alphas converge asymptotically, getting arbitrarily close but never exactly there. We must use a tolerance ε to decide when we're "close enough."
Theorem (Keerthi et al., 2001): Under the following conditions, SMO converges in a finite number of iterations:
The proof relies on showing that:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npimport matplotlib.pyplot as plt def analyze_convergence_theory(): """ Demonstrate key concepts in SMO convergence. """ # Example: Dual objective function (2D case for visualization) # W(α) = α₁ + α₂ - 0.5 * (α₁²Q₁₁ + 2α₁α₂Q₁₂ + α₂²Q₂₂) # Kernel matrix (positive definite for this example) Q = np.array([[1.0, 0.3], [0.3, 1.0]]) def objective(alpha): """Dual objective W(α).""" return np.sum(alpha) - 0.5 * alpha @ Q @ alpha def gradient(alpha): """Gradient ∇W(α) = 1 - Qα.""" return np.ones(2) - Q @ alpha # Simulate SMO optimization trajectory C = 2.0 # Upper bound alpha = np.array([0.0, 0.0]) # Start at origin trajectory = [alpha.copy()] objectives = [objective(alpha)] for iteration in range(50): # Find which variable to update (simplified) grad = gradient(alpha) # KKT check: is α at a bound where it shouldn't move? kkt_violated = False for i in range(2): if alpha[i] == 0 and grad[i] > 1e-6: kkt_violated = True elif alpha[i] == C and grad[i] < -1e-6: kkt_violated = True elif 0 < alpha[i] < C and abs(grad[i]) > 1e-6: kkt_violated = True if not kkt_violated: break # Simple coordinate ascent (one variable at a time) i = iteration % 2 # Optimal step size along direction i step = grad[i] / Q[i, i] new_alpha_i = alpha[i] + step # Clip to [0, C] new_alpha_i = np.clip(new_alpha_i, 0, C) alpha[i] = new_alpha_i trajectory.append(alpha.copy()) objectives.append(objective(alpha)) trajectory = np.array(trajectory) objectives = np.array(objectives) # Compute optimal solution analytically # For unbounded case: α* = Q⁻¹ · 1 alpha_opt_unbounded = np.linalg.solve(Q, np.ones(2)) alpha_opt = np.clip(alpha_opt_unbounded, 0, C) W_opt = objective(alpha_opt) print("SMO Convergence Analysis") print("=" * 50) print(f"Kernel matrix Q:\n{Q}") print(f"\nOptimal solution: α* = {alpha_opt}") print(f"Optimal objective: W* = {W_opt:.4f}") print(f"\nTrajectory ({len(trajectory)} points):") for i, (a, w) in enumerate(zip(trajectory[:10], objectives[:10])): gap = W_opt - w print(f" Iter {i:2d}: α = [{a[0]:.4f}, {a[1]:.4f}], W = {w:.4f}, gap = {gap:.4f}") return trajectory, objectives, W_opt def convergence_rate_visualization(): """ Visualize convergence rate: linear vs sublinear. """ iterations = np.arange(1, 101) # Linear convergence: error decays exponentially linear_rate = 0.9 linear_error = (1 - linear_rate) ** iterations # Sublinear convergence: error decays as 1/k sublinear_error = 1 / iterations # Very slow convergence: error decays as 1/sqrt(k) slow_error = 1 / np.sqrt(iterations) plt.figure(figsize=(10, 6)) plt.semilogy(iterations, linear_error, 'b-', linewidth=2, label='Linear: O(ρᵏ)') plt.semilogy(iterations, sublinear_error, 'g-', linewidth=2, label='Sublinear: O(1/k)') plt.semilogy(iterations, slow_error, 'r-', linewidth=2, label='Slow: O(1/√k)') plt.xlabel('Iteration k', fontsize=12) plt.ylabel('Optimality Gap (log scale)', fontsize=12) plt.title('Convergence Rate Comparison', fontsize=14) plt.legend(fontsize=11) plt.grid(True, alpha=0.3) plt.xlim([0, 100]) plt.ylim([1e-6, 1]) return plt.gcf() # Run analysistrajectory, objectives, W_opt = analyze_convergence_theory()Knowing that SMO converges is not enough—we need to understand how fast. Convergence rate determines whether training takes minutes or days.
Optimization algorithms are characterized by their convergence rate:
Linear Convergence: $$W^* - W_k \leq \rho^k (W^* - W_0)$$
Error decreases by a constant factor ρ < 1 each iteration. After k iterations, error is O(ρᵏ). This is fast: 10 iterations might reduce error by 10×.
Sublinear Convergence: $$W^* - W_k \leq \frac{C}{k^p}$$
Error decreases polynomially. Common rates are O(1/k) or O(1/√k). This is slower: reaching 10× reduction might require 10-100 iterations.
SMO's convergence rate depends heavily on the problem and implementation:
First-Order Methods (MVP):
Second-Order Methods (WSS):
With Shrinking:
| Factor | Effect on Rate | Explanation |
|---|---|---|
| Condition number of Q | High κ(Q) → slower | Ill-conditioned kernels create elongated level sets |
| Number of support vectors | More SVs → slower | More free variables to optimize simultaneously |
| Class overlap | More overlap → slower | Ambiguous examples require fine-tuning |
| C parameter | High C → more SVs → slower | Less regularization means more margin violations |
| Kernel bandwidth (RBF γ) | Extreme γ → slower | Very wide/narrow kernels cause ill-conditioning |
| Working set selection | WSS > MVP > Platt | Second-order makes more progress per iteration |
A common convergence pattern in SMO is:
Fast Initial Progress: The first 80% of optimization happens quickly as the algorithm identifies which examples are clearly not support vectors (αᵢ = 0) or clearly are (αᵢ = C or free).
Slow Tail: The final 20% takes disproportionately long as the algorithm fine-tunes the exact values of free support vectors.
The slow tail occurs because:
Practical Implication: For many applications, reaching 99% of optimal accuracy in 10% of the computational budget is acceptable. Early stopping can provide substantial time savings with minimal accuracy loss.
If training seems stuck: (1) Plot the objective or optimality gap over iterations—is it still improving? (2) Check kernel condition number—extreme values suggest rescaling or kernel change. (3) Count support vectors—extremely high numbers suggest C is too large. (4) Try bigger tolerance—maybe you're past the point of diminishing returns.
Since asymptotic convergence means we never reach the exact optimum, we need practical stopping criteria. Several options exist, each with trade-offs.
The standard criterion checks the optimality gap:
$$\text{gap} = \max_{i \in I_{up}} G_i - \min_{j \in I_{low}} G_j$$
where $G_i = -y_i \nabla_i W$. Stop when:
$$\text{gap} < \epsilon$$
Typical values: ε = 10⁻³ (LIBSVM default)
Interpretation: The gap measures how far we are from satisfying complementary slackness. A gap of 10⁻³ means the largest KKT violation is about 0.1% of the scale of kernel responses.
The duality gap between primal and dual objectives provides another stopping criterion:
$$\text{duality gap} = P(w, b, \xi) - D(\alpha)$$
At the optimum, primal = dual (strong duality for convex problems). Stop when:
$$\text{duality gap} < \epsilon_{pd}$$
Advantage: Scale-independent measure of optimality. Disadvantage: Requires computing primal objective, which needs $w = \sum \alpha_i y_i \phi(x_i)$—expensive for nonlinear kernels.
Simplest criterion: stop after a fixed number of iterations.
When to use: As a safety net to prevent runaway computation. Typical value: max_iter = 10⁶ or 10⁷ for large problems.
Stop when the objective barely changes:
$$\frac{W_{k} - W_{k-100}}{|W_k|} < \epsilon_{rel}$$
Advantage: Detects when we're in the slow tail. Disadvantage: Sensitive to scale of objective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206
import numpy as np class ConvergenceMonitor: """ Monitor convergence during SMO training. Provides multiple stopping criteria and diagnostic information. """ def __init__(self, tol=1e-3, max_iter=10000, relative_tol=1e-6, patience=100): """ Parameters: tol: KKT tolerance (optimality gap threshold) max_iter: Maximum number of iterations relative_tol: Relative objective change threshold patience: Iterations of no improvement before early stopping """ self.tol = tol self.max_iter = max_iter self.relative_tol = relative_tol self.patience = patience # History for diagnostics self.gaps = [] self.objectives = [] self.sv_counts = [] self.iteration = 0 self.no_improvement_count = 0 self.best_objective = -np.inf def check_convergence(self, alpha, y, grad, C, objective=None): """ Check if optimization should stop. Parameters: alpha: Current Lagrange multipliers y: Labels grad: Gradient G = -y * ∇W C: Regularization parameter objective: Current dual objective (optional) Returns: (converged: bool, reason: str) """ self.iteration += 1 # Compute optimality gap (KKT-based) gap = self._compute_optimality_gap(alpha, y, grad, C) self.gaps.append(gap) # Track objective if provided if objective is not None: self.objectives.append(objective) if objective > self.best_objective + self.relative_tol * abs(self.best_objective): self.best_objective = objective self.no_improvement_count = 0 else: self.no_improvement_count += 1 # Track support vector count sv_count = np.sum(alpha > 0) self.sv_counts.append(sv_count) # Check stopping criteria # 1. KKT tolerance if gap < self.tol: return True, f"Converged: optimality gap {gap:.2e} < tolerance {self.tol}" # 2. Maximum iterations if self.iteration >= self.max_iter: return True, f"Max iterations ({self.max_iter}) reached, gap = {gap:.2e}" # 3. No improvement for too long if self.no_improvement_count >= self.patience: return True, f"Early stopping: no improvement for {self.patience} iterations" # 4. Relative objective change (if enough history) if len(self.objectives) > 100: old_obj = self.objectives[-100] new_obj = self.objectives[-1] rel_change = abs(new_obj - old_obj) / (abs(new_obj) + 1e-10) if rel_change < self.relative_tol: return True, f"Converged: relative change {rel_change:.2e} < {self.relative_tol}" return False, "" def _compute_optimality_gap(self, alpha, y, grad, C): """Compute the optimality gap.""" # I_up: indices where α can increase I_up = np.where( ((alpha < C) & (y == 1)) | ((alpha > 0) & (y == -1)) )[0] # I_low: indices where α can decrease I_low = np.where( ((alpha < C) & (y == -1)) | ((alpha > 0) & (y == 1)) )[0] if len(I_up) == 0 or len(I_low) == 0: return 0.0 return np.max(grad[I_up]) - np.min(grad[I_low]) def summary(self): """Return convergence summary statistics.""" return { 'iterations': self.iteration, 'final_gap': self.gaps[-1] if self.gaps else None, 'final_objective': self.objectives[-1] if self.objectives else None, 'final_sv_count': self.sv_counts[-1] if self.sv_counts else None, 'gap_reduction': self.gaps[0] / self.gaps[-1] if len(self.gaps) > 1 else 1, } def plot_convergence(self): """Generate convergence plots.""" import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Optimality gap (log scale) ax = axes[0, 0] ax.semilogy(self.gaps, 'b-', linewidth=1) ax.axhline(y=self.tol, color='r', linestyle='--', label=f'tolerance={self.tol}') ax.set_xlabel('Iteration') ax.set_ylabel('Optimality Gap') ax.set_title('Convergence of Optimality Gap') ax.legend() ax.grid(True, alpha=0.3) # Objective value if self.objectives: ax = axes[0, 1] ax.plot(self.objectives, 'g-', linewidth=1) ax.set_xlabel('Iteration') ax.set_ylabel('Dual Objective W(α)') ax.set_title('Objective Function Progress') ax.grid(True, alpha=0.3) # Support vector count ax = axes[1, 0] ax.plot(self.sv_counts, 'm-', linewidth=1) ax.set_xlabel('Iteration') ax.set_ylabel('Number of Support Vectors') ax.set_title('Support Vector Discovery') ax.grid(True, alpha=0.3) # Gap per iteration (rate estimate) if len(self.gaps) > 10: ax = axes[1, 1] gaps = np.array(self.gaps) gap_ratio = gaps[1:] / gaps[:-1] ax.plot(gap_ratio, 'c-', linewidth=0.5, alpha=0.7) ax.axhline(y=1.0, color='r', linestyle='--') ax.set_xlabel('Iteration') ax.set_ylabel('Gap Ratio (gap[k]/gap[k-1])') ax.set_title('Convergence Rate Estimate') ax.set_ylim([0, 1.5]) ax.grid(True, alpha=0.3) plt.tight_layout() return fig # Example usagedef demonstrate_convergence_monitoring(): """ Simulate SMO training with convergence monitoring. """ np.random.seed(42) n = 100 monitor = ConvergenceMonitor(tol=1e-3, max_iter=500) # Simulate convergence: gap decreases exponentially then slows for i in range(500): # Simulate decreasing gap if i < 200: gap = 10 * (0.97 ** i) # Fast initial convergence else: gap = 0.01 + 0.005 * np.random.rand() # Slow tail # Create fake state consistent with this gap alpha = np.random.rand(n) * 0.5 y = np.random.choice([-1, 1], n) grad = np.random.randn(n) * gap C = 1.0 objective = -np.random.rand() * gap # Larger gap = more negative converged, reason = monitor.check_convergence(alpha, y, grad, C, objective) if converged: print(f"Stopped at iteration {i}: {reason}") break print(f"\nConvergence Summary:") for key, value in monitor.summary().items(): print(f" {key}: {value}") demonstrate_convergence_monitoring()The tolerance ε profoundly affects training time and solution quality:
| Tolerance | Typical Use Case | Training Time | Solution Quality |
|---|---|---|---|
| 10⁻¹ | Quick prototyping | Very fast | Rough approximation |
| 10⁻³ | Production default | Moderate | Good for most tasks |
| 10⁻⁵ | High-precision needs | Slow | Near-optimal |
| 10⁻⁷ | Research/comparison | Very slow | Essentially optimal |
Rule of thumb: Start with 10⁻³. If test performance is unsatisfactory and you suspect undertrained model, try 10⁻⁴. Rarely is anything below 10⁻⁵ worth the computational cost.
Understanding what affects convergence helps you tune SVM training effectively. We'll examine the key factors systematically.
1. Class Separability:
2. Class Balance:
3. Data Dimensionality:
4. Data Scaling:
RBF Kernel γ Parameter:
Polynomial Kernel Degree:
Kernel Condition Number:
The condition number κ(Q) of the kernel matrix strongly affects convergence: $$\kappa(Q) = \frac{\lambda_{max}(Q)}{\lambda_{min}(Q)}$$
C controls the trade-off between margin width and training error:
Small C (strong regularization): Wide margin, many errors allowed. Fewer support vectors, faster convergence, but potential underfitting.
Large C (weak regularization): Narrow margin, fewer errors allowed. More support vectors (potentially all), slower convergence, better training fit.
Sweet spot: Usually C ∈ [0.1, 1000], found via cross-validation. Extreme C values (< 0.01 or > 10,000) often indicate mismatched kernel or need for different preprocessing.
If your trained SVM has nearly 100% support vectors, something is wrong. Either: (1) C is too large, (2) kernel is mismatched to data, (3) classes are inseparable in this representation, or (4) preprocessing is missing. Investigate before proceeding.
When SMO training doesn't converge as expected, systematic diagnosis is essential. Here's a framework for identifying and fixing common issues.
Possible Causes:
Diagnostic:
- Plot optimality gap vs. iteration
- If gap decreasing but slowly → tolerance or kernel issue
- If gap fluctuating → numerical stability issue
- If gap stuck → subproblem degenerate (η ≤ 0 cases)
Possible Causes:
Diagnostic:
- Compare training vs. test accuracy
- Train accuracy >> test accuracy → overfitting
- Both accuracies low → underfitting
- Reduce C and check if test accuracy improves
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210
import numpy as np class ConvergenceDiagnostics: """ Tools for diagnosing SMO convergence issues. """ @staticmethod def diagnose_kernel_conditioning(K, threshold=1e6): """ Check kernel matrix conditioning. Parameters: K: Kernel matrix (n×n) threshold: Condition number above which we flag issues Returns: Dictionary with diagnostic information """ # Compute eigenvalues eigenvalues = np.linalg.eigvalsh(K) lambda_max = np.max(eigenvalues) lambda_min = np.max([np.min(eigenvalues), 1e-10]) # Avoid division by zero condition_number = lambda_max / lambda_min # Check for numerical issues n_negative = np.sum(eigenvalues < 0) n_near_zero = np.sum(np.abs(eigenvalues) < 1e-10) is_well_conditioned = condition_number < threshold is_positive_semidefinite = n_negative == 0 diagnosis = { 'condition_number': condition_number, 'lambda_max': lambda_max, 'lambda_min': lambda_min, 'n_negative_eigenvalues': n_negative, 'n_near_zero_eigenvalues': n_near_zero, 'well_conditioned': is_well_conditioned, 'positive_semidefinite': is_positive_semidefinite, } if not is_positive_semidefinite: diagnosis['recommendation'] = "Kernel has negative eigenvalues. Use a valid Mercer kernel." elif not is_well_conditioned: diagnosis['recommendation'] = f"Condition number {condition_number:.2e} is high. Try adjusting kernel parameters." else: diagnosis['recommendation'] = "Kernel matrix looks healthy." return diagnosis @staticmethod def diagnose_support_vectors(alpha, C, threshold_ratio=0.9): """ Analyze support vector distribution. Parameters: alpha: Learned Lagrange multipliers C: Regularization parameter threshold_ratio: Fraction at which to warn about too many SVs Returns: Dictionary with diagnostic information """ n = len(alpha) n_sv = np.sum(alpha > 0) n_bound = np.sum(np.isclose(alpha, C)) n_free = np.sum((alpha > 0) & (alpha < C - 1e-8)) n_zero = np.sum(alpha == 0) sv_ratio = n_sv / n bound_ratio = n_bound / n_sv if n_sv > 0 else 0 diagnosis = { 'n_samples': n, 'n_support_vectors': n_sv, 'n_bound_svs': n_bound, 'n_free_svs': n_free, 'n_non_svs': n_zero, 'sv_ratio': sv_ratio, 'bound_sv_ratio': bound_ratio, } if sv_ratio > threshold_ratio: diagnosis['issue'] = 'too_many_svs' diagnosis['recommendation'] = f"SV ratio {sv_ratio:.1%} is high. Consider reducing C or changing kernel." elif bound_ratio > 0.9 and n_free < 10: diagnosis['issue'] = 'almost_all_bound' diagnosis['recommendation'] = "Almost all SVs at bounds. C may be too large or problem is degenerate." elif n_sv == 0: diagnosis['issue'] = 'no_svs' diagnosis['recommendation'] = "No support vectors! Check data and parameters." else: diagnosis['issue'] = None diagnosis['recommendation'] = "Support vector distribution looks reasonable." return diagnosis @staticmethod def diagnose_convergence_trajectory(gaps, window=100): """ Analyze convergence trajectory to detect issues. Parameters: gaps: List of optimality gaps over iterations window: Window for trend analysis Returns: Dictionary with diagnostic information """ gaps = np.array(gaps) if len(gaps) < window: return {'issue': 'insufficient_data', 'recommendation': 'Need more iterations for diagnosis.'} # Compute rate of decrease early_gaps = gaps[:window] late_gaps = gaps[-window:] early_decrease = (early_gaps[0] - early_gaps[-1]) / early_gaps[0] late_decrease = (late_gaps[0] - late_gaps[-1]) / late_gaps[0] # Check for oscillation late_std = np.std(late_gaps) late_mean = np.mean(late_gaps) oscillation_ratio = late_std / late_mean if late_mean > 0 else 0 # Check for plateau is_plateau = late_decrease < 0.01 is_oscillating = oscillation_ratio > 0.1 # Estimate convergence rate (geometric) if len(gaps) > 2: ratios = gaps[1:] / gaps[:-1] avg_ratio = np.mean(ratios[-window:]) else: avg_ratio = 1.0 diagnosis = { 'early_decrease': early_decrease, 'late_decrease': late_decrease, 'oscillation_ratio': oscillation_ratio, 'avg_convergence_ratio': avg_ratio, 'is_plateau': is_plateau, 'is_oscillating': is_oscillating, } if is_oscillating: diagnosis['issue'] = 'oscillation' diagnosis['recommendation'] = "Gap is oscillating. Possible numerical issues or tolerance too tight." elif is_plateau and late_gaps[-1] > 1e-2: diagnosis['issue'] = 'premature_plateau' diagnosis['recommendation'] = "Converged to suboptimal solution. Check kernel and parameters." elif avg_ratio > 0.999: diagnosis['issue'] = 'very_slow' diagnosis['recommendation'] = "Very slow convergence. Consider early stopping or parameter tuning." else: diagnosis['issue'] = None diagnosis['recommendation'] = "Convergence trajectory looks normal." return diagnosis # Demonstrationdef run_diagnostics_demo(): """Demonstrate convergence diagnostics.""" np.random.seed(42) print("=== Convergence Diagnostics Demo ===\n") # 1. Kernel conditioning n = 50 X = np.random.randn(n, 10) # Well-conditioned kernel (moderate gamma) gamma = 0.1 K_good = np.exp(-gamma * np.sum((X[:, None, :] - X[None, :, :]) ** 2, axis=2)) print("1. Well-conditioned RBF kernel (γ=0.1):") diag = ConvergenceDiagnostics.diagnose_kernel_conditioning(K_good) for k, v in diag.items(): print(f" {k}: {v}") # Ill-conditioned kernel (extreme gamma) gamma = 100 K_bad = np.exp(-gamma * np.sum((X[:, None, :] - X[None, :, :]) ** 2, axis=2)) print("\n2. Ill-conditioned RBF kernel (γ=100):") diag = ConvergenceDiagnostics.diagnose_kernel_conditioning(K_bad) for k, v in diag.items(): print(f" {k}: {v}") # 2. Support vector analysis print("\n3. Support vector analysis:") # Case: too many SVs alpha_bad = np.random.rand(100) * 0.8 diag = ConvergenceDiagnostics.diagnose_support_vectors(alpha_bad, C=1.0) print(f" High SV case: {diag['recommendation']}") # Case: healthy alpha_good = np.zeros(100) alpha_good[10:25] = 0.5 # 15 free SVs alpha_good[25:30] = 1.0 # 5 bound SVs diag = ConvergenceDiagnostics.diagnose_support_vectors(alpha_good, C=1.0) print(f" Healthy case: {diag['recommendation']}") run_diagnostics_demo()Convergence is where theory meets practice in SVM optimization. We've covered the full spectrum from mathematical guarantees to debugging techniques:
With convergence understood, we turn to Complexity Analysis—the formal study of how SMO's computational requirements scale with problem size. We'll derive time and space complexity bounds, analyze the impact of caching and shrinking, and understand when alternative methods become preferable.
This analysis is essential for deciding whether SVMs are appropriate for a given problem scale and for estimating training times.
You now possess a comprehensive understanding of SMO convergence—when it's guaranteed, how fast it happens, when to stop, and how to diagnose problems. This knowledge enables you to confidently train SVMs and troubleshoot when things don't go as expected.