Loading learning content...
All decomposition methods (OvO, OvA, DAG) share a fundamental limitation: they solve surrogate problems. None directly optimizes what we truly care about—the multi-class classification margin.
What if we could formulate a single optimization problem that simultaneously considers all K classes, directly maximizing the margin between the correct class and all incorrect classes?
This is precisely what the Crammer-Singer formulation achieves. Introduced by Koby Crammer and Yoram Singer in 2001, it provides the most theoretically principled approach to multi-class SVMs. The formulation elegantly extends the binary maximum margin principle to K classes, yielding a single, coherent optimization problem with provable generalization bounds.
By the end of this page, you will understand the Crammer-Singer primal and dual formulations, the multi-class margin definition, the constrained optimization problem, solution algorithms including online and batch methods, generalization theory, and when this approach outperforms decomposition methods.
Before formulating the optimization, we must generalize the concept of "margin" from binary to multi-class classification.
Binary Margin Recap:
For binary SVM with labels $y \in {-1, +1}$, the margin of a point $(\mathbf{x}, y)$ is: $$\gamma = y \cdot (\mathbf{w}^\top \mathbf{x} + b)$$
Correct classification requires $\gamma > 0$; larger margins indicate more confident, robust predictions.
Multi-class Margin Definition:
For K classes, we maintain K weight vectors $\mathbf{w}_1, \ldots, \mathbf{w}K$ (one per class). The prediction for input $\mathbf{x}$ is: $$\hat{y} = \arg\max{k \in {1, \ldots, K}} \mathbf{w}_k^\top \mathbf{x}$$
(We absorb the bias into $\mathbf{w}$ by appending 1 to $\mathbf{x}$.)
The multi-class margin for a point $(\mathbf{x}, y)$ is the gap between the correct class score and the highest incorrect class score: $$\gamma = \mathbf{w}y^\top \mathbf{x} - \max{k \neq y} \mathbf{w}_k^\top \mathbf{x}$$
A positive margin γ > 0 means the correct class has the highest score—correct classification. A larger margin means the correct class dominates all others more decisively. A negative margin indicates misclassification.
Margin Properties:
Correct classification: $\gamma > 0 \Leftrightarrow$ prediction is correct
Confidence measure: $\gamma$ quantifies how much "room" there is before the prediction would change
Robustness: Larger $\gamma$ implies the prediction is robust to small perturbations in $\mathbf{x}$ or $\mathbf{w}$
Example (K=3 classes):
For input $\mathbf{x}$ with true label $y = 2$:
Margin: $\gamma = 2.1 - \max(0.5, 1.3) = 2.1 - 1.3 = 0.8$
The correct class (2) beats the runner-up (class 3) by 0.8. This is our margin to maximize.
The All-vs-All Perspective:
Crammer-Singer requires the correct class to beat every other class by at least margin 1: $$\mathbf{w}_y^\top \mathbf{x} - \mathbf{w}_k^\top \mathbf{x} \geq 1, \quad \forall k \neq y$$
This is equivalent to: $$\mathbf{w}_y^\top \mathbf{x} \geq \mathbf{w}_k^\top \mathbf{x} + 1, \quad \forall k \neq y$$
The correct class score must exceed every incorrect class score by at least 1. This is a single constraint per incorrect class, yielding K-1 constraints per training example.
Comparison to Binary SVM:
| Aspect | Binary SVM | Crammer-Singer |
|---|---|---|
| Weight vectors | 1 ($\mathbf{w}$) | K ($\mathbf{w}_1, \ldots, \mathbf{w}_K$) |
| Constraints per point | 1 | K-1 |
| Margin meaning | Signed distance to hyperplane | Gap to nearest competitor |
| Correct classification | $y \cdot f(\mathbf{x}) > 0$ | $\mathbf{w}y^\top \mathbf{x} > \max{k \neq y} \mathbf{w}_k^\top \mathbf{x}$ |
The Crammer-Singer primal formulation extends the binary soft-margin SVM to K classes.
The Crammer-Singer Primal Problem:
$$\min_{\mathbf{w}_1, \ldots, \mathbf{w}K, \boldsymbol{\xi}} \frac{1}{2} \sum{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i$$
subject to: $$\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i \geq 1 - \delta{y_i, k} - \xi_i, \quad \forall i \in [n], \forall k \in [K]$$ $$\xi_i \geq 0, \quad \forall i \in [n]$$
where $\delta_{y_i, k}$ is the Kronecker delta (1 if $y_i = k$, else 0).
The term (1 - δ) ensures that when k = y_i (correct class), the constraint becomes 0 ≥ -ξ_i, which is always satisfied for ξ_i ≥ 0. For k ≠ y_i, the constraint enforces the margin requirement with slack ξ_i.
Unpacking the Formulation:
Objective Function:
$$\frac{1}{2} \sum_{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i$$
Constraints:
For each training point $i$ and each class $k \neq y_i$: $$\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}_i \geq 1 - \xi_i$$
This says: the correct class score must beat class $k$'s score by at least $1 - \xi_i$.
Special Properties:
Single slack per point: Unlike formulations with $\xi_{ik}$ for each (point, class) pair, Crammer-Singer uses one $\xi_i$ per point. This $\xi_i$ absorbs the maximum violation across all incorrect classes.
Binding constraint determines slack: $\xi_i = \max_k [1 - (\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}k^\top \mathbf{x}i)]+$ where $[z]+ = \max(0, z)$.
Total constraints: $n \cdot K$ constraints (most are trivially satisfied for the correct class).
Alternative Formulation (Equivalent):
We can rewrite the constraints more symmetrically using indicator notation. Let $e_k \in \mathbb{R}^K$ be the one-hot vector for class $k$. Define the combined weight matrix $\mathbf{W} \in \mathbb{R}^{d \times K}$ with columns $\mathbf{w}_1, \ldots, \mathbf{w}_K$.
The constraint becomes: $$\mathbf{W}^\top \mathbf{x}i \cdot (e{y_i} - e_k) \geq 1 - \delta_{y_i, k} - \xi_i$$
In matrix form, this is elegant but the practical optimization uses the explicit constraints.
Relation to Multi-class Hinge Loss:
The Crammer-Singer formulation is equivalent to minimizing the multi-class hinge loss: $$L_{CS}(\mathbf{x}, y) = \max_{k \neq y} \left( 1 + \mathbf{w}_k^\top \mathbf{x} - \mathbf{w}y^\top \mathbf{x} \right)+$$
This is the largest violation among all incorrect classes, matching our single-slack-per-point formulation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as np def crammer_singer_loss(W: np.ndarray, X: np.ndarray, y: np.ndarray) -> float: """ Compute Crammer-Singer multi-class hinge loss. Parameters: ----------- W : array of shape (n_features, n_classes) Weight matrix with columns as class weight vectors X : array of shape (n_samples, n_features) Input data y : array of shape (n_samples,) True labels (0-indexed) Returns: -------- loss : float Average Crammer-Singer loss """ n_samples, n_features = X.shape n_classes = W.shape[1] # Compute all scores: (n_samples, n_classes) scores = X @ W total_loss = 0.0 for i in range(n_samples): correct_class = y[i] correct_score = scores[i, correct_class] # Find maximum margin violation among incorrect classes max_violation = 0.0 for k in range(n_classes): if k != correct_class: # Margin violation: 1 + score_k - score_correct violation = 1 + scores[i, k] - correct_score max_violation = max(max_violation, violation) # Hinge: take max with 0 total_loss += max(0, max_violation) return total_loss / n_samples def crammer_singer_loss_vectorized( W: np.ndarray, X: np.ndarray, y: np.ndarray) -> float: """ Vectorized Crammer-Singer loss computation. """ scores = X @ W # (n_samples, n_classes) n_samples = X.shape[0] # Get correct class scores correct_scores = scores[np.arange(n_samples), y] # (n_samples,) # Compute margins: 1 + score_k - score_correct for all k margins = 1 + scores - correct_scores[:, np.newaxis] # (n_samples, n_classes) # Zero out the correct class margins (we want max over k != y) margins[np.arange(n_samples), y] = 0 # Max margin violation per sample max_violations = np.max(margins, axis=1) # Hinge: max(0, max_violation) losses = np.maximum(0, max_violations) return np.mean(losses) def crammer_singer_gradient( W: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray: """ Compute gradient of Crammer-Singer loss w.r.t. W. Returns: -------- gradient : array of shape (n_features, n_classes) """ scores = X @ W n_samples, n_features = X.shape n_classes = W.shape[1] # Compute margins correct_scores = scores[np.arange(n_samples), y] margins = 1 + scores - correct_scores[:, np.newaxis] margins[np.arange(n_samples), y] = -np.inf # Exclude correct class # Find the argmax class (the violator) violating_class = np.argmax(margins, axis=1) # (n_samples,) max_margins = margins[np.arange(n_samples), violating_class] # Gradient contribution only where loss > 0 active = max_margins > 0 # (n_samples,) gradient = np.zeros_like(W) for i in range(n_samples): if active[i]: k = violating_class[i] yi = y[i] # Gradient: increase w_k, decrease w_yi gradient[:, k] += X[i] / n_samples gradient[:, yi] -= X[i] / n_samples return gradientLike binary SVM, Crammer-Singer has a dual formulation that enables kernelization and provides computational advantages for certain problem sizes.
Deriving the Dual:
Introduce Lagrange multipliers $\alpha_{ik} \geq 0$ for each constraint (training point $i$, class $k$). The Lagrangian is:
$$L = \frac{1}{2} \sum_{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i - \sum_{i,k} \alpha_{ik} \left[ \mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i - 1 + \delta{y_i,k} + \xi_i \right]$$
KKT Conditions:
Taking derivatives and setting to zero:
$$\frac{\partial L}{\partial \mathbf{w}k} = 0 \Rightarrow \mathbf{w}k = \sum{i: y_i = k} \left( \sum{j \neq k} \alpha_{ij} \right) \mathbf{x}i - \sum{i: y_i \neq k} \alpha_{ik} \mathbf{x}_i$$
$$\frac{\partial L}{\partial \xi_i} = 0 \Rightarrow C = \sum_{k} \alpha_{ik}$$
Each α_ik measures how much the constraint for point i against class k is 'active'. Points with non-zero α are support vectors. The constraint Σ_k α_ik = C ensures the single-slack-per-point structure.
The Crammer-Singer Dual Problem:
After substitution and simplification, the dual becomes:
$$\max_{\boldsymbol{\alpha}} \sum_{i,k} \alpha_{ik} (1 - \delta_{y_i, k}) - \frac{1}{2} \sum_{k=1}^{K} \left| \sum_i b_{ik} \mathbf{x}_i \right|^2$$
where $b_{ik} = \alpha_{ik} - \delta_{y_i, k} \sum_j \alpha_{ij}$
subject to: $$\alpha_{ik} \geq 0, \quad \forall i, k$$ $$\sum_{k} \alpha_{ik} = C, \quad \forall i$$ $$\alpha_{i, y_i} = 0, \quad \forall i$$
Key Properties of the Dual:
Dimension: $(n \times K)$ variables, but each row sums to $C$ and one entry is 0, so effectively $n(K-1)$ free variables.
Sparsity: Most $\alpha_{ik}$ are zero; only support vectors have non-zero values.
Kernelization: The dual depends on $\mathbf{x}i$ only through inner products, enabling kernel extension: $$|\sum_i b{ik} \mathbf{x}i|^2 = \sum{i,j} b_{ik} b_{jk} \langle \mathbf{x}i, \mathbf{x}j \rangle = \sum{i,j} b{ik} b_{jk} K(\mathbf{x}_i, \mathbf{x}_j)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
import numpy as npfrom scipy.optimize import minimize class CrammerSingerDual: """ Crammer-Singer SVM solved via dual formulation. Supports kernel functions for non-linear classification. """ def __init__(self, C: float = 1.0, kernel: str = 'linear', gamma: float = 1.0): """ Parameters: ----------- C : float Regularization parameter kernel : str 'linear' or 'rbf' gamma : float RBF kernel parameter (ignored for linear) """ self.C = C self.kernel = kernel self.gamma = gamma self.alpha = None self.X_train = None self.y_train = None self.classes_ = None def _compute_kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """Compute kernel matrix.""" if self.kernel == 'linear': return X1 @ X2.T elif self.kernel == 'rbf': # ||x - y||^2 = ||x||^2 + ||y||^2 - 2 x·y sq_X1 = np.sum(X1 ** 2, axis=1, keepdims=True) sq_X2 = np.sum(X2 ** 2, axis=1, keepdims=True) dist_sq = sq_X1 + sq_X2.T - 2 * (X1 @ X2.T) return np.exp(-self.gamma * dist_sq) else: raise ValueError(f"Unknown kernel: {self.kernel}") def fit(self, X: np.ndarray, y: np.ndarray) -> 'CrammerSingerDual': """ Fit the Crammer-Singer SVM. """ self.X_train = X self.y_train = y self.classes_ = np.unique(y) self.n_classes = len(self.classes_) n_samples, n_features = X.shape K = self.n_classes # Precompute kernel matrix self.K_matrix = self._compute_kernel(X, X) # Initialize alpha: shape (n_samples, n_classes) # Each row sums to C, entry at y_i is 0 alpha_init = np.zeros((n_samples, K)) for i in range(n_samples): # Distribute C evenly among incorrect classes for k in range(K): if k != y[i]: alpha_init[i, k] = self.C / (K - 1) # Flatten for optimizer alpha_flat = alpha_init.flatten() # Optimization with constraints from scipy.optimize import minimize, LinearConstraint def objective(alpha_flat): alpha = alpha_flat.reshape((n_samples, K)) # First term: sum of alpha * (1 - delta) term1 = 0 for i in range(n_samples): for k in range(K): if k != y[i]: term1 += alpha[i, k] # Second term: quadratic term2 = 0 for k in range(K): # Compute b_ik for this class b_k = np.zeros(n_samples) for i in range(n_samples): sum_alpha_i = np.sum(alpha[i]) b_k[i] = alpha[i, k] - (1 if y[i] == k else 0) * sum_alpha_i # ||sum_i b_ik x_i||^2 = b_k^T K b_k term2 += b_k @ self.K_matrix @ b_k # Dual maximization = minimize negative return -(term1 - 0.5 * term2) # This is a simplified version; production code uses specialized solvers print("Solving Crammer-Singer dual optimization...") result = minimize( objective, alpha_flat, method='L-BFGS-B', bounds=[(0, self.C) for _ in range(n_samples * K)], options={'maxiter': 1000} ) self.alpha = result.x.reshape((n_samples, K)) return self def predict(self, X: np.ndarray) -> np.ndarray: """ Predict class labels. """ # Compute kernel between test and training points K_test = self._compute_kernel(X, self.X_train) n_samples = X.shape[0] n_train = self.X_train.shape[0] K = self.n_classes # Compute w_k^T x for each class via kernel expansion scores = np.zeros((n_samples, K)) for k in range(K): # b_ik for training points b_k = np.zeros(n_train) for i in range(n_train): sum_alpha_i = np.sum(self.alpha[i]) b_k[i] = self.alpha[i, k] - (1 if self.y_train[i] == k else 0) * sum_alpha_i # Score for class k: sum_i b_ik K(x, x_i) scores[:, k] = K_test @ b_k return self.classes_[np.argmax(scores, axis=1)]Solving the Crammer-Singer optimization is computationally challenging due to the coupled structure (single slack per point implies constraints interact). Several specialized algorithms have been developed.
1. Sequential Minimal Optimization (SMO) Extension
The SMO algorithm, which efficiently solves binary SVMs by optimizing two variables at a time, can be extended to Crammer-Singer. The key insight is that due to the $\sum_k \alpha_{ik} = C$ constraint, we must optimize at least two $\alpha_{ik}$ values simultaneously for each point $i$.
Modified SMO for Crammer-Singer:
This approach, used in LIBLINEAR, achieves near-optimal efficiency for medium-sized problems.
Efficient working set selection is crucial for SMO-like algorithms. Heuristics like Maximum Violating Pair (selecting the two classes with largest KKT violation) significantly accelerate convergence.
2. Stochastic Gradient Descent (SGD)
For large-scale problems, SGD on the primal is often most practical:
Update Rule:
For each randomly sampled $(\mathbf{x}_i, y_i)$:
Find the violating class: $k^* = \arg\max_{k \neq y_i} (1 + \mathbf{w}_k^\top \mathbf{x}i - \mathbf{w}{y_i}^\top \mathbf{x}_i)$
If margin violated ($1 + \mathbf{w}_{k^*}^\top \mathbf{x}i - \mathbf{w}{y_i}^\top \mathbf{x}_i > 0$):
Regularization update: $\mathbf{w}_k \leftarrow (1 - \eta \lambda) \mathbf{w}_k$ for all $k$
3. Coordinate Descent
Optimize one coordinate of $\mathbf{w}$ at a time, cycling through all $(d \times K)$ coordinates. Efficient for sparse data where coordinate updates affect few constraints.
4. PEGASOS-style Algorithm
Primal Estimated sub-GrAdient SOlver for SVM, extended to multi-class:
$$\mathbf{w}_k^{(t+1)} = \frac{1}{\lambda (t+1)} \left[ \mathbf{w}k^{(t)} - \eta_t \nabla_k L{CS}(\mathbf{x}_t, y_t) \right]$$
with projection to enforce bounded norm.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
import numpy as np class CrammerSingerSGD: """ Crammer-Singer SVM via Stochastic Gradient Descent. Efficient for large-scale multi-class classification. """ def __init__( self, n_classes: int, n_features: int, C: float = 1.0, learning_rate: float = 0.01, n_epochs: int = 100, decay: float = 0.01 ): self.n_classes = n_classes self.n_features = n_features self.C = C self.learning_rate = learning_rate self.n_epochs = n_epochs self.decay = decay self.lambda_reg = 1 / (C * n_features) # Regularization strength # Initialize weights: (n_features, n_classes) self.W = np.zeros((n_features, n_classes)) def _compute_loss_and_update( self, x: np.ndarray, y: int, lr: float ) -> float: """ Compute loss for single sample and update weights. Returns loss value for monitoring. """ # Compute scores for all classes scores = x @ self.W # (n_classes,) # Find the most violating class (not y) margins = 1 + scores - scores[y] margins[y] = -np.inf # Exclude correct class k_star = np.argmax(margins) margin_violation = margins[k_star] # Compute hinge loss loss = max(0, margin_violation) if loss > 0: # Gradient update: push w_y higher, push w_k* lower self.W[:, y] += lr * x self.W[:, k_star] -= lr * x # Regularization (weight decay) self.W *= (1 - lr * self.lambda_reg) return loss def fit(self, X: np.ndarray, y: np.ndarray) -> 'CrammerSingerSGD': """ Train the classifier using SGD. """ n_samples = X.shape[0] for epoch in range(self.n_epochs): # Shuffle data indices = np.random.permutation(n_samples) epoch_loss = 0.0 for idx in indices: # Learning rate schedule t = epoch * n_samples + idx + 1 lr = self.learning_rate / (1 + self.decay * t) loss = self._compute_loss_and_update(X[idx], y[idx], lr) epoch_loss += loss avg_loss = epoch_loss / n_samples if (epoch + 1) % 10 == 0: print(f"Epoch {epoch + 1}/{self.n_epochs}, " f"Avg Loss: {avg_loss:.4f}") return self def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels.""" scores = X @ self.W return np.argmax(scores, axis=1) def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Compute class scores (not probabilities, but proportional). Use softmax for calibrated probabilities. """ scores = X @ self.W # Softmax normalization exp_scores = np.exp(scores - np.max(scores, axis=1, keepdims=True)) return exp_scores / np.sum(exp_scores, axis=1, keepdims=True) def train_crammer_singer_mini_batch( X: np.ndarray, y: np.ndarray, n_classes: int, batch_size: int = 32, C: float = 1.0, n_epochs: int = 100) -> np.ndarray: """ Mini-batch SGD for Crammer-Singer. More stable than pure SGD due to gradient averaging. """ n_samples, n_features = X.shape W = np.zeros((n_features, n_classes)) lambda_reg = 1 / (C * n_samples) for epoch in range(n_epochs): indices = np.random.permutation(n_samples) for batch_start in range(0, n_samples, batch_size): batch_end = min(batch_start + batch_size, n_samples) batch_idx = indices[batch_start:batch_end] X_batch = X[batch_idx] y_batch = y[batch_idx] actual_batch = len(batch_idx) # Compute gradient for batch gradient = np.zeros_like(W) for i in range(actual_batch): scores = X_batch[i] @ W margins = 1 + scores - scores[y_batch[i]] margins[y_batch[i]] = -np.inf k_star = np.argmax(margins) if margins[k_star] > 0: gradient[:, y_batch[i]] -= X_batch[i] gradient[:, k_star] += X_batch[i] gradient /= actual_batch # Learning rate t = epoch * (n_samples // batch_size) + batch_start // batch_size + 1 lr = 1.0 / (lambda_reg * t) # Update with regularization W = (1 - lr * lambda_reg) * W - lr * gradient return WThe Crammer-Singer formulation has strong theoretical foundations with provable generalization bounds.
Generalization Bound (Crammer & Singer, 2001):
With probability at least $1 - \delta$, the generalization error of the Crammer-Singer classifier is bounded by:
$$\text{err}(f) \leq \frac{1}{n} \sum_i \mathbb{1}[\gamma_i < 1] + O\left( \sqrt{\frac{R^2 |\mathbf{W}|_F^2 + \log(1/\delta)}{n}} \right)$$
where $R$ is the maximum norm of training examples and $|\mathbf{W}|_F$ is the Frobenius norm of the weight matrix.
Interpretation:
The generalization bound directly relates to the joint weight matrix norm, which is what Crammer-Singer explicitly minimizes. Decomposition methods minimize individual classifier weights but don't control their joint complexity.
Comparison with Decomposition Methods:
| Aspect | Crammer-Singer | OvO/OvA |
|---|---|---|
| Optimizes | Global multi-class margin | Individual binary margins |
| Generalization bound | Direct (joint) | Indirect (per-classifier) |
| Parameter coupling | Yes (all K classes) | No (independent) |
| Theoretical elegance | High | Moderate |
Margin Definition Matters:
Crammer-Singer's margin is the gap between correct and best incorrect class. This is a more demanding criterion than OvA (correct class beats "rest") or OvO (each pair separately).
Fisher Consistency:
The Crammer-Singer loss is Fisher consistent: as $n \to \infty$, the minimizer of the empirical risk converges to the Bayes optimal classifier (assuming appropriate model capacity).
Comparison with Weston-Watkins:
Another multi-class SVM formulation (Weston & Watkins, 1999) uses a slack variable $\xi_{ik}$ for each (point, class) pair:
$$\min \frac{1}{2} \sum_k |\mathbf{w}k|^2 + C \sum{i,k} \xi_{ik}$$
subject to: $\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i \geq 1 - \xi{ik}$
Weston-Watkins penalizes all margin violations; Crammer-Singer penalizes only the maximum violation. Crammer-Singer is generally preferred as it focuses on the hardest competitor.
| Formulation | Slack Type | Penalty Focus | Complexity |
|---|---|---|---|
| Crammer-Singer | One per point | Max violation | Lower |
| Weston-Watkins | One per (point, class) | All violations | Higher |
| Lee et al. (2004) | One per point (sum) | Sum of violations | Medium |
Given the theoretical elegance of Crammer-Singer, why isn't it always used? The answer lies in practical tradeoffs.
Empirical Observations:
Accuracy: In practice, well-tuned OvO/OvA often achieve comparable or even better accuracy than Crammer-Singer on many datasets. The theoretical advantages don't always translate to significant practical gains.
Training speed: OvO trains K(K-1)/2 small problems in parallel; Crammer-Singer solves one large coupled problem. For large n and K, decomposition can be faster.
Scalability: Decomposition methods scale linearly with number of processors (embarrassingly parallel). Crammer-Singer has limited parallelization opportunities.
Library support: OvO/OvA are the defaults in sklearn, LIBSVM, and most libraries. Crammer-Singer requires specialized solvers.
In industry, decomposition methods dominate due to practical considerations. In research on multi-class learning theory or when maximum accuracy is critical, Crammer-Singer provides a principled baseline.
Implementing Crammer-Singer efficiently requires attention to several practical considerations:
1. Initialization:
Weight initialization affects convergence. Common strategies:
2. Feature Preprocessing:
3. Regularization Tuning:
The C parameter (or equivalently, λ = 1/C) requires careful tuning:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
from sklearn.base import BaseEstimator, ClassifierMixinfrom sklearn.preprocessing import LabelEncoder, StandardScalerfrom sklearn.model_selection import cross_val_scoreimport numpy as np class PracticalCrammerSinger(BaseEstimator, ClassifierMixin): """ Production-ready Crammer-Singer SVM with sklearn compatibility. Features: - Automatic label encoding - Feature scaling option - Cross-validation for C selection - Warm-start from OvA """ def __init__( self, C: float = 1.0, learning_rate: float = 0.01, n_epochs: int = 100, scale_features: bool = True, warm_start: bool = False, random_state: int = 42 ): self.C = C self.learning_rate = learning_rate self.n_epochs = n_epochs self.scale_features = scale_features self.warm_start = warm_start self.random_state = random_state def fit(self, X, y): """ Fit the Crammer-Singer SVM. """ np.random.seed(self.random_state) # Encode labels self.label_encoder_ = LabelEncoder() y_encoded = self.label_encoder_.fit_transform(y) self.classes_ = self.label_encoder_.classes_ self.n_classes_ = len(self.classes_) # Scale features if self.scale_features: self.scaler_ = StandardScaler() X_scaled = self.scaler_.fit_transform(X) else: X_scaled = X n_samples, n_features = X_scaled.shape # Initialize weights if self.warm_start: # Initialize from OvA solution self.W_ = self._init_from_ova(X_scaled, y_encoded) else: self.W_ = np.zeros((n_features, self.n_classes_)) # Train with SGD lambda_reg = 1 / (self.C * n_samples) for epoch in range(self.n_epochs): indices = np.random.permutation(n_samples) for idx in indices: x = X_scaled[idx] yi = y_encoded[idx] # Learning rate decay t = epoch * n_samples + idx + 1 lr = self.learning_rate / np.sqrt(t) # Forward pass scores = x @ self.W_ margins = 1 + scores - scores[yi] margins[yi] = -np.inf k_star = np.argmax(margins) # Update if margin violated if margins[k_star] > 0: self.W_[:, yi] += lr * x self.W_[:, k_star] -= lr * x # Regularization self.W_ *= (1 - lr * lambda_reg) return self def _init_from_ova(self, X, y): """ Initialize weights using OvA classifiers. """ from sklearn.svm import LinearSVC n_features = X.shape[1] W = np.zeros((n_features, self.n_classes_)) for k in range(self.n_classes_): y_binary = (y == k).astype(int) * 2 - 1 # {-1, +1} clf = LinearSVC(C=self.C, max_iter=100) clf.fit(X, y_binary) W[:, k] = clf.coef_.flatten() return W def predict(self, X): """Predict class labels.""" if self.scale_features: X_scaled = self.scaler_.transform(X) else: X_scaled = X scores = X_scaled @ self.W_ y_pred_encoded = np.argmax(scores, axis=1) return self.label_encoder_.inverse_transform(y_pred_encoded) def score(self, X, y): """Accuracy score.""" return np.mean(self.predict(X) == y) def tune_crammer_singer_c(X, y, C_values=None, cv=5): """ Tune C parameter via cross-validation. """ if C_values is None: C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] best_C = None best_score = 0 for C in C_values: clf = PracticalCrammerSinger(C=C, n_epochs=50) scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy') mean_score = np.mean(scores) print(f"C={C:>6}, CV Accuracy: {mean_score:.4f} (+/- {np.std(scores):.4f})") if mean_score > best_score: best_score = mean_score best_C = C print(f"\nBest C: {best_C}, Best CV Accuracy: {best_score:.4f}") return best_CWe have thoroughly explored the Crammer-Singer multi-class SVM formulation—the most theoretically principled approach to extending SVMs beyond binary classification. Let's consolidate the key insights:
You now understand the Crammer-Singer formulation in depth—from margin definition through optimization algorithms to practical implementation. Next, we'll cover practical considerations for deploying multi-class SVMs in production, including hyperparameter tuning, computational tradeoffs, and library selection.
What's Next:
The final page of this module covers Practical Considerations for multi-class SVM deployment. We'll synthesize everything learned and provide concrete guidance on method selection, hyperparameter tuning, scalability strategies, and common pitfalls to avoid in production systems.