Multi Class Svm - Learning Module

Loading content...

0/245

Practical Considerations for Multi-class SVM

From Theory to Production

We've now covered all the major multi-class SVM approaches: One-vs-One, One-vs-All, DAG-SVM, and Crammer-Singer. Each has distinct theoretical properties, computational characteristics, and empirical behaviors.

But theory alone doesn't build production systems. The real questions practitioners face are:

Which method should I use for my specific problem?
How do I tune hyperparameters effectively?
What computational resources do I need?
Which library should I choose?
What mistakes will I inevitably make, and how can I avoid them?

This page provides the practical wisdom needed to deploy multi-class SVMs successfully. We synthesize the module's insights into actionable guidance, drawing from both research findings and production experience.

What You Will Learn

By the end of this page, you will have a decision framework for method selection, comprehensive hyperparameter tuning strategies, understanding of computational tradeoffs at scale, library selection guidance, and awareness of common pitfalls that derail multi-class SVM deployments.

Method Selection Framework

Selecting the right multi-class strategy depends on your specific constraints. Here's a systematic decision framework:

Primary Decision Factors:

Number of classes (K): The most critical factor
Training data size (n): Affects algorithm choice
Prediction latency requirements: Real-time vs batch
Accuracy requirements: When every 0.1% matters
Available compute: Training budget constraints

Multi-class SVM Method Selection Guide
Scenario	Recommended Method	Rationale
K ≤ 10, accuracy critical	OvO with voting	Best empirical accuracy; overhead minimal
K ≤ 10, fast prediction needed	OvA	Simpler; K classifiers is fine
10 < K ≤ 50, balanced needs	OvO with voting or DAG-SVM	Good accuracy; DAG if latency matters
50 < K ≤ 100	DAG-SVM or OvA	DAG for OvO quality with OvA speed
K > 100	OvA or specialized methods	Quadratic growth makes OvO impractical
K > 1000	Hierarchical or neural approaches	SVMs struggle at extreme scale
Imbalanced classes, any K	OvO (any variant)	Balanced pairwise training
Need calibrated probabilities	OvA with Platt scaling	Most natural probability framework
Theoretical guarantees matter	Crammer-Singer	Principled unified optimization

The Default Choice

When in doubt, start with OvO voting (the LIBSVM/sklearn default). It's robust, well-tested, and handles most scenarios well. Only switch if you identify specific bottlenecks (prediction latency, class imbalance, etc.).

Decision Flowchart:

                          Start
                            |
                   Is K > 100?
                   /         \
                 Yes          No
                  |            |
           Consider         Is real-time
           alternatives     prediction needed?
           (hierarchical,   /            \
            neural)       Yes             No
                           |               |
                      Use DAG-SVM      Is class
                      or OvA           imbalance severe?
                                       /          \
                                     Yes           No
                                      |             |
                                  Use OvO       Use OvO
                                  voting        or OvA

Secondary Considerations:

Kernel vs linear: For high-d, sparse data (text), linear often sufficient
Parallel resources: More cores → OvO/OvA more attractive (parallelizable)
Memory constraints: OvA uses K models; OvO uses K(K-1)/2
Interpretability: OvA classifiers may be easier to interpret individually

Hyperparameter Tuning Strategies

Hyperparameter tuning is where multi-class SVMs are won or lost. The key parameters are:

1. Regularization Parameter (C)

The C parameter controls the tradeoff between margin maximization and training error minimization:

Small C (e.g., 0.001): Large margin, tolerates many errors, risks underfitting
Large C (e.g., 1000): Small margin, penalizes errors heavily, risks overfitting

Typical search range: $C \in {10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 10^2, 10^3}$

2. Kernel Parameters

For non-linear kernels:

RBF: $\gamma$ controls the kernel width. Larger $\gamma$ → more complex, localized decision boundaries
Polynomial: Degree $d$ and coefficient $c$: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^\top \mathbf{z} + c)^d$

Typical RBF γ search: $\gamma \in {10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10}$

The Curse of Grid Search

Grid search over C and γ with 7 values each (common practice) means 49 combinations. With 5-fold CV and K classes, training time explodes. Use coarse-to-fine search or smarter methods.

Efficient Tuning Strategies:

1. Coarse-to-Fine Grid Search:

Start with a coarse grid (powers of 10), find the best region, then refine:

Coarse: $C \in {0.01, 0.1, 1, 10, 100}$, $\gamma \in {0.001, 0.01, 0.1, 1}$ (20 combos)
Fine: If best is $C=1, \gamma=0.1$, search $C \in {0.5, 1, 2}$, $\gamma \in {0.05, 0.1, 0.2}$

2. Heuristic Initialization:

Good starting points based on data:

$C = 1$ (the sklearn default, often reasonable)
$\gamma = 1/(n_{features} \cdot X.var())$ for RBF (sklearn uses $1/n_{features}$)

3. Random Search:

For large hyperparameter spaces, random search often finds good solutions faster than grid search (Bergstra & Bengio, 2012).

4. Bayesian Optimization:

Use tools like Optuna, sklearn-optimize, or Ray Tune for intelligent hyperparameter search that learns from previous trials.

hyperparameter_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import (
    GridSearchCV, RandomizedSearchCV, cross_val_score
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.stats import loguniform
import time
 
def tune_multiclass_svm_coarse_to_fine(X, y, cv=5):
    """
    Efficient coarse-to-fine hyperparameter tuning.
    """
    # Create pipeline with scaling
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC())
    ])
    
    # Phase 1: Coarse search
    print("Phase 1: Coarse grid search...")
    coarse_params = {
        'svm__C': [0.01, 0.1, 1, 10, 100],
        'svm__gamma': ['scale', 0.001, 0.01, 0.1, 1],
        'svm__kernel': ['rbf']
    }
    
    coarse_search = GridSearchCV(
        pipe, coarse_params, cv=cv, scoring='accuracy', n_jobs=-1
    )
    coarse_search.fit(X, y)
    
    best_C = coarse_search.best_params_['svm__C']
    best_gamma = coarse_search.best_params_['svm__gamma']
    
    print(f"  Best coarse: C={best_C}, gamma={best_gamma}, "
          f"score={coarse_search.best_score_:.4f}")
    
    # Phase 2: Fine search around best coarse values
    print("Phase 2: Fine grid search...")
    
    if best_gamma == 'scale':
        gamma_fine = ['scale', 0.5 * (1/X.shape[1]), 2 * (1/X.shape[1])]
    else:
        gamma_fine = [best_gamma * 0.5, best_gamma, best_gamma * 2]
    
    fine_params = {
        'svm__C': [best_C * 0.5, best_C, best_C * 2],
        'svm__gamma': gamma_fine,
        'svm__kernel': ['rbf']
    }
    
    fine_search = GridSearchCV(
        pipe, fine_params, cv=cv, scoring='accuracy', n_jobs=-1
    )
    fine_search.fit(X, y)
    
    print(f"  Best fine: {fine_search.best_params_}")
    print(f"  Final score: {fine_search.best_score_:.4f}")
    
    return fine_search.best_estimator_, fine_search.best_params_
 
 
def tune_with_random_search(X, y, n_iter=50, cv=5):
    """
    Random search with log-uniform distributions.
    
    Often faster than grid search for finding good solutions.
    """
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC())
    ])
    
    param_distributions = {
        'svm__C': loguniform(1e-3, 1e3),  # Log-uniform from 0.001 to 1000
        'svm__gamma': loguniform(1e-4, 1e1),  # Log-uniform from 0.0001 to 10
        'svm__kernel': ['rbf']
    }
    
    random_search = RandomizedSearchCV(
        pipe, 
        param_distributions, 
        n_iter=n_iter, 
        cv=cv, 
        scoring='accuracy',
        n_jobs=-1,
        random_state=42
    )
    
    start = time.time()
    random_search.fit(X, y)
    elapsed = time.time() - start
    
    print(f"Random search completed in {elapsed:.1f}s")
    print(f"Best params: {random_search.best_params_}")
    print(f"Best score: {random_search.best_score_:.4f}")
    
    return random_search.best_estimator_, random_search.best_params_
 
 
def tune_with_optuna(X, y, n_trials=100, cv=5):
    """
    Bayesian optimization with Optuna.
    
    Learns from previous trials to explore promising regions.
    """
    try:
        import optuna
    except ImportError:
        print("Optuna not installed. Run: pip install optuna")
        return None
    
    def objective(trial):
        C = trial.suggest_loguniform('C', 1e-3, 1e3)
        gamma = trial.suggest_loguniform('gamma', 1e-4, 1e1)
        
        clf = Pipeline([
            ('scaler', StandardScaler()),
            ('svm', SVC(C=C, gamma=gamma, kernel='rbf'))
        ])
        
        scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
        return scores.mean()
    
    # Suppress Optuna logs for cleaner output
    optuna.logging.set_verbosity(optuna.logging.WARNING)
    
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=n_trials, show_progress_bar=True)
    
    print(f"Best trial:")
    print(f"  Value (accuracy): {study.best_trial.value:.4f}")
    print(f"  Params: {study.best_trial.params}")
    
    # Train final model with best params
    best_model = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(**study.best_trial.params, kernel='rbf'))
    ])
    best_model.fit(X, y)
    
    return best_model, study.best_trial.params

Computational Tradeoffs at Scale

Understanding computational costs is essential for planning resources and meeting latency requirements.

Training Complexity Summary:

Method	Time Complexity	Space Complexity	Parallelizable?
OvA	$O(K \cdot n^2 d)$	$O(K \cdot d)$	Yes (K tasks)
OvO	$O(K^2 \cdot (n/K)^2 d)$	$O(K^2 \cdot d)$	Yes (K² tasks)
DAG-SVM	Same as OvO	Same as OvO	Yes
Crammer-Singer	$O(K \cdot n^2 d)$	$O(K \cdot d)$	Limited

Note: SVM training complexity depends heavily on the solver. SMO is typically $O(n^2)$ to $O(n^3)$.

The Linear Case

For linear SVMs (common with high-d sparse data), specialized solvers like LIBLINEAR achieve O(n·d) training complexity using coordinate descent. This dramatically changes the scaling picture.

Prediction Complexity:

Method	Classifiers Evaluated	Total Operations	Memory Access
OvA	K	$O(K \cdot d)$	K weight vectors
OvO Voting	K(K-1)/2	$O(K^2 \cdot d)$	K(K-1)/2 vectors
DAG-SVM	K-1	$O(K \cdot d)$	K-1 vectors
Crammer-Singer	1 (K classes)	$O(K \cdot d)$	K weight vectors

Practical Implications:

Training bottleneck: For n > 10,000, consider approximate solvers or SGD
Prediction bottleneck: For K > 50 with OvO, consider DAG-SVM
Memory bottleneck: For kernel SVMs, storing support vectors dominates
Bandwidth bottleneck: For distributed training, communication overhead matters

Estimated Training Times (Rough Guidelines)
Dataset Size	K Classes	Method	Linear (LIBLINEAR)	RBF Kernel
10K samples	10	OvO	~1 second	~10 seconds
10K samples	100	OvO	~10 seconds	~2 minutes
100K samples	10	OvO	~10 seconds	~10 minutes
100K samples	100	OvO	~2 minutes	~2 hours
1M samples	10	OvO	~2 minutes	~hours (use approx)
1M samples	100	OvO	~30 minutes	Impractical

Scaling Strategies:

1. Use Linear Kernels When Possible

For high-d sparse data (text, one-hot encoded features), linear kernels often perform comparably to RBF while being 10-100× faster.

2. Subsample for Hyperparameter Tuning

Tune hyperparameters on 10-20% of data, then train final model on full data:

from sklearn.model_selection import train_test_split
X_tune, _, y_tune, _ = train_test_split(X, y, train_size=0.1, stratify=y)
# Tune on X_tune, y_tune
# Train final model on X, y

3. Use Approximate Kernel Methods

For large n with RBF kernels, use Random Fourier Features (Nyström approximation) to approximate the kernel with linear operations:

from sklearn.kernel_approximation import RBFSampler
rbf_feature = RBFSampler(gamma=0.1, n_components=500)
X_approx = rbf_feature.fit_transform(X)
# Use LinearSVC on X_approx

Library Ecosystem and Selection

Choosing the right library impacts development speed, production performance, and long-term maintenance. Here's a comprehensive guide:

SVM Library Comparison
Library	Multi-class Default	Strengths	Best For
sklearn (SVC)	OvO	Ease of use, great API, good docs	Prototyping, small-medium data
sklearn (LinearSVC)	OvA	Fast linear training (LIBLINEAR)	High-d sparse data, large n
LIBSVM	OvO	Reference implementation, well-tested	Research, custom integrations
LIBLINEAR	OvA	Very fast linear, sparse support	Text classification at scale
ThunderSVM	OvO	GPU acceleration	Large kernel SVM problems
cuML (RAPIDS)	OvO	GPU, sklearn-compatible API	GPU clusters, large scale
Vowpal Wabbit	Custom	Online learning, extreme scale	Billions of examples

The Default Recommendation

Start with sklearn's SVC (for kernel SVMs) or LinearSVC (for linear). They handle most cases well. Only switch for specific needs: GPU acceleration (ThunderSVM, cuML), extreme scale (Vowpal Wabbit), or custom requirements (LIBSVM/LIBLINEAR directly).

library_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# ============================================================
# Multi-class SVM with Different Libraries - Examples
# ============================================================
 
# 1. sklearn SVC (default OvO, kernel SVMs)
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
 
# Basic usage - OvO is automatic
model_svc = make_pipeline(
    StandardScaler(),
    SVC(C=1.0, kernel='rbf', decision_function_shape='ovr')  # Note: 'ovr' for score shape only
)
model_svc.fit(X_train, y_train)
predictions = model_svc.predict(X_test)
 
 
# 2. sklearn LinearSVC (OvA, fast linear)
from sklearn.svm import LinearSVC
 
# OvA is automatic for LinearSVC
model_linear = make_pipeline(
    StandardScaler(),
    LinearSVC(C=1.0, dual=False, max_iter=10000)  # dual=False faster for n > n_features
)
model_linear.fit(X_train, y_train)
 
 
# 3. Explicit multi-class with custom strategy
from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier
 
# Force OvO even with LinearSVC
model_linear_ovo = OneVsOneClassifier(
    LinearSVC(C=1.0)
)
 
# Force OvA with SVC
model_svc_ova = OneVsRestClassifier(
    SVC(C=1.0, kernel='rbf')
)
 
 
# 4. GPU-accelerated with ThunderSVM
def train_with_thundersvm(X, y):
    """
    GPU-accelerated SVM training.
    Requires: pip install thundersvm-cpu (or thundersvm for GPU)
    """
    try:
        from thundersvm import SVC as ThunderSVC
        
        model = ThunderSVC(
            kernel='rbf',
            C=1.0,
            gamma='auto'
        )
        model.fit(X, y)
        return model
    except ImportError:
        print("ThunderSVM not installed")
        return None
 
 
# 5. RAPIDS cuML for GPU clusters
def train_with_cuml(X, y):
    """
    GPU-accelerated with NVIDIA RAPIDS.
    Requires: conda install -c rapidsai cuml
    """
    try:
        from cuml.svm import SVC as cuSVC
        import cudf
        
        # Convert to GPU arrays
        X_gpu = cudf.DataFrame(X)
        y_gpu = cudf.Series(y)
        
        model = cuSVC(
            kernel='rbf',
            C=1.0,
            gamma='auto'
        )
        model.fit(X_gpu, y_gpu)
        return model
    except ImportError:
        print("cuML not installed")
        return None
 
 
# 6. Comparing libraries on same data
def benchmark_libraries(X, y, test_size=0.2):
    """
    Benchmark different SVM libraries.
    """
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    import time
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=y
    )
    
    # Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    results = {}
    
    # sklearn SVC
    start = time.time()
    svc = SVC(C=1.0, kernel='rbf')
    svc.fit(X_train_scaled, y_train)
    pred = svc.predict(X_test_scaled)
    results['sklearn_SVC'] = {
        'time': time.time() - start,
        'accuracy': accuracy_score(y_test, pred)
    }
    
    # sklearn LinearSVC
    start = time.time()
    linear = LinearSVC(C=1.0, max_iter=10000)
    linear.fit(X_train_scaled, y_train)
    pred = linear.predict(X_test_scaled)
    results['sklearn_LinearSVC'] = {
        'time': time.time() - start,
        'accuracy': accuracy_score(y_test, pred)
    }
    
    print("Library Benchmark Results:")
    print("-" * 50)
    for lib, res in results.items():
        print(f"{lib:20} | Time: {res['time']:6.2f}s | Acc: {res['accuracy']:.4f}")
    
    return results

Common Pitfalls and How to Avoid Them

Experienced practitioners know that multi-class SVMs have several failure modes. Here are the most common pitfalls and their solutions:

Pitfall #1: Forgetting Feature Scaling

•The Problem: SVMs are sensitive to feature scales. Features with large values dominate the kernel computation.
•Symptoms: Poor accuracy, seemingly random predictions, slow convergence
•Solution: ALWAYS standardize features (StandardScaler) before SVM training
•Exception: Pre-normalized data (e.g., TF-IDF already has unit vectors)

Pitfall #2: Ignoring Class Imbalance

•The Problem: Minority classes get overwhelmed by majority classes, especially in OvA
•Symptoms: Model predicts majority class almost always; low recall for minority classes
•Solution: Use class_weight='balanced' or compute weights manually
•Code: SVC(class_weight='balanced') or LinearSVC(class_weight='balanced')

Pitfall #3: Using Default Hyperparameters

•The Problem: Default C=1 and gamma='scale' are rarely optimal for your specific data
•Symptoms: Mediocre accuracy; significant gains possible but not realized
•Solution: Always tune C and gamma via cross-validation before final training
•Minimum effort: At least try C ∈ {0.1, 1, 10, 100} with CV

Pitfall #4: Leaking Test Data

•The Problem: Fitting the scaler on all data before train/test split
•Symptoms: Overly optimistic test accuracy that doesn't generalize
•Solution: Use pipelines! Scaler fits only on training data
•Code: Pipeline([('scaler', StandardScaler()), ('svm', SVC())])

Pitfall #5: Ignoring Prediction Latency

•The Problem: Training works fine, but production prediction is too slow
•Symptoms: API timeouts, user complaints, scaling issues
•Solution: Benchmark prediction time during development; consider DAG-SVM for large K
•Test: pred_time = time.time(); model.predict(X_batch); print(time.time() - pred_time)

The Checklist

Before deploying: (1) Features scaled? (2) Class weights set? (3) Hyperparameters tuned? (4) Using a pipeline? (5) Prediction latency measured? If all yes, you've avoided the most common mistakes.

Production Deployment Patterns

Deploying multi-class SVMs in production requires attention to serialization, monitoring, and maintenance. Here are battle-tested patterns:

1. Model Serialization:

Use joblib (not pickle) for sklearn models—it's optimized for numpy arrays:

import joblib

# Save
joblib.dump(model, 'multiclass_svm_v1.joblib')

# Load
model = joblib.load('multiclass_svm_v1.joblib')

2. Version Your Models:

Always version models with their training metadata:

import json
from datetime import datetime

metadata = {
    'version': '1.2.0',
    'trained_at': datetime.now().isoformat(),
    'n_classes': len(model.classes_),
    'training_samples': len(X_train),
    'cv_accuracy': cv_accuracy,
    'hyperparameters': best_params
}

with open('model_metadata.json', 'w') as f:
    json.dump(metadata, f)

3. Input Validation:

Always validate inputs before prediction:

def predict_safe(model, scaler, X):
    # Validate shape
    if X.ndim == 1:
        X = X.reshape(1, -1)
    assert X.shape[1] == expected_features, f"Expected {expected_features} features"
    
    # Validate values
    assert not np.isnan(X).any(), "NaN values detected"
    
    # Scale and predict
    X_scaled = scaler.transform(X)
    return model.predict(X_scaled)

production_deployment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
import joblib
import json
from datetime import datetime
from pathlib import Path
from typing import Tuple, Dict, Any
import logging
 
logger = logging.getLogger(__name__)
 
class ProductionMulticlassSVM:
    """
    Production-ready wrapper for multi-class SVM models.
    
    Handles:
    - Serialization with metadata
    - Input validation
    - Monitoring hooks
    - A/B testing support
    """
    
    def __init__(self, model, scaler, metadata: Dict[str, Any]):
        self.model = model
        self.scaler = scaler
        self.metadata = metadata
        self.prediction_count = 0
        self.error_count = 0
        
    @classmethod
    def train_and_save(
        cls, 
        X: np.ndarray, 
        y: np.ndarray,
        model_params: Dict, 
        save_dir: str,
        version: str
    ) -> 'ProductionMulticlassSVM':
        """
        Train, evaluate, and save a production-ready model.
        """
        from sklearn.svm import SVC
        from sklearn.preprocessing import StandardScaler
        from sklearn.model_selection import cross_val_score
        
        save_path = Path(save_dir)
        save_path.mkdir(parents=True, exist_ok=True)
        
        # Scale features
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Train model
        logger.info(f"Training SVM with params: {model_params}")
        model = SVC(**model_params)
        model.fit(X_scaled, y)
        
        # Evaluate
        cv_scores = cross_val_score(model, X_scaled, y, cv=5)
        
        # Build metadata
        metadata = {
            'version': version,
            'trained_at': datetime.now().isoformat(),
            'n_classes': len(np.unique(y)),
            'classes': list(model.classes_),
            'n_features': X.shape[1],
            'n_training_samples': len(y),
            'cv_accuracy_mean': float(cv_scores.mean()),
            'cv_accuracy_std': float(cv_scores.std()),
            'hyperparameters': model_params,
            'n_support_vectors': int(sum(model.n_support_)),
        }
        
        # Save everything
        joblib.dump(model, save_path / f'model_{version}.joblib')
        joblib.dump(scaler, save_path / f'scaler_{version}.joblib')
        with open(save_path / f'metadata_{version}.json', 'w') as f:
            json.dump(metadata, f, indent=2)
        
        logger.info(f"Model saved to {save_path}")
        
        return cls(model, scaler, metadata)
    
    @classmethod
    def load(cls, save_dir: str, version: str) -> 'ProductionMulticlassSVM':
        """
        Load a saved production model.
        """
        save_path = Path(save_dir)
        
        model = joblib.load(save_path / f'model_{version}.joblib')
        scaler = joblib.load(save_path / f'scaler_{version}.joblib')
        with open(save_path / f'metadata_{version}.json', 'r') as f:
            metadata = json.load(f)
        
        logger.info(f"Loaded model version {version} (trained {metadata['trained_at']})")
        
        return cls(model, scaler, metadata)
    
    def predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Make predictions with input validation.
        
        Returns:
        --------
        predictions : array
            Class predictions
        confidence : array
            Decision function scores for predicted class
        """
        self.prediction_count += 1
        
        # Input validation
        X = self._validate_input(X)
        
        # Scale
        X_scaled = self.scaler.transform(X)
        
        # Predict
        predictions = self.model.predict(X_scaled)
        
        # Get confidence scores
        if hasattr(self.model, 'decision_function'):
            scores = self.model.decision_function(X_scaled)
            if scores.ndim == 1:  # Binary case
                confidence = np.abs(scores)
            else:
                # Take score for predicted class
                confidence = scores.max(axis=1)
        else:
            confidence = np.ones(len(predictions))  # No confidence available
        
        return predictions, confidence
    
    def _validate_input(self, X: np.ndarray) -> np.ndarray:
        """Validate and reshape input."""
        # Ensure 2D
        if X.ndim == 1:
            X = X.reshape(1, -1)
        
        # Check features
        expected = self.metadata['n_features']
        if X.shape[1] != expected:
            self.error_count += 1
            raise ValueError(f"Expected {expected} features, got {X.shape[1]}")
        
        # Check for NaN
        if np.isnan(X).any():
            self.error_count += 1
            raise ValueError("Input contains NaN values")
        
        return X
    
    def get_health_metrics(self) -> Dict[str, Any]:
        """Return health metrics for monitoring."""
        return {
            'model_version': self.metadata['version'],
            'prediction_count': self.prediction_count,
            'error_count': self.error_count,
            'error_rate': self.error_count / max(1, self.prediction_count),
            'training_accuracy': self.metadata['cv_accuracy_mean'],
        }

Performance Monitoring and Maintenance

Monitoring Multi-class SVM in Production:

Deployed models require ongoing monitoring to detect drift, degradation, and anomalies.

Key Metrics to Track:

Prediction Latency: P50, P95, P99 latencies
Throughput: Predictions per second
Class Distribution: Are predictions matching expected distribution?
Confidence Scores: Are scores drifting over time?
Error Rates: Exceptions, validation failures

Monitoring Best Practices

•Log every prediction (sampled in high-throughput scenarios) with input hash, output, confidence, latency
•Alert on distribution shift: If class 3 goes from 10% to 40% of predictions, investigate
•Track confidence trends: Decreasing average confidence may indicate data drift
•A/B test new models: Deploy new versions to subset of traffic; compare metrics
•Retrain regularly: Monthly or quarterly retraining with fresh data prevents staleness

Handling Data Drift:

Data drift occurs when the input distribution changes from training time. For multi-class SVMs:

Detection: Track feature statistics (mean, variance) and compare to training baselines
Alert: Trigger when KL divergence or Kolmogorov-Smirnov statistics exceed thresholds
Mitigation: Retrain on recent data, possibly with online learning for incremental updates

When to Retrain:

Signal	Action
Accuracy drops below threshold	Immediate retrain
Confidence scores declining	Investigate, likely retrain
Class distribution shifts	Validate labels, consider retrain
New classes emerge	Retrain required (add to label set)
Feature values out of range	Check data pipeline, possibly retrain
Periodic schedule	Retrain regardless (prevents silent degradation)

The Shadow Mode Pattern

Before replacing a production model, run the new model in 'shadow mode': it receives the same inputs as the production model but its predictions aren't served. Compare shadow predictions to production predictions to catch regressions before they impact users.

Summary: Multi-class SVM Mastery

We've now covered comprehensive practical guidance for deploying multi-class SVMs. Combined with the theoretical foundations from previous pages, you have the complete toolkit for multi-class SVM success.

Practical Takeaways

•Method selection depends on K and constraints — OvO for quality, DAG-SVM or OvA for speed, Crammer-Singer for theory.
•Always tune hyperparameters — Default C and γ are rarely optimal; use coarse-to-fine or Bayesian optimization.
•Scale features — StandardScaler should be part of every SVM pipeline.
•Handle class imbalance — Use class_weight='balanced' or OvO for inherent balance.
•Measure prediction latency — Critical for production; DAG-SVM if OvO is too slow.
•Use pipelines — Prevent data leakage and ensure reproducibility.
•Monitor in production — Track latency, confidence, class distribution for drift detection.

Module Complete!

Congratulations! You've mastered Multi-class SVM—from theoretical foundations (OvO, OvA, DAG-SVM, Crammer-Singer) to production deployment. You can now confidently select, train, tune, and deploy multi-class classification systems using Support Vector Machines.

Module Recap:

Page 0: One-vs-One — K(K-1)/2 pairwise classifiers with majority voting
Page 1: One-vs-All — K classifiers, each detecting one class from the rest
Page 2: DAG-SVM — OvO classifiers with O(K) prediction via sequential elimination
Page 3: Crammer-Singer — Unified optimization with direct multi-class margin
Page 4: Practical Considerations — Method selection, tuning, deployment, and monitoring

With this knowledge, you're equipped to tackle any multi-class classification challenge where SVMs are applicable. The next chapter will explore ensemble methods that can further boost performance beyond single classifiers.