Machine LearningMulti-class SVM

Multi-class Support Vector Machines

LevelAdvanced

Duration90 mins

TopicMulti-class SVM

4 / 5

Crammer-Singer Formulation

A Unified Multi-class Objective

All decomposition methods (OvO, OvA, DAG) share a fundamental limitation: they solve surrogate problems. None directly optimizes what we truly care about—the multi-class classification margin.

What if we could formulate a single optimization problem that simultaneously considers all K classes, directly maximizing the margin between the correct class and all incorrect classes?

This is precisely what the Crammer-Singer formulation achieves. Introduced by Koby Crammer and Yoram Singer in 2001, it provides the most theoretically principled approach to multi-class SVMs. The formulation elegantly extends the binary maximum margin principle to K classes, yielding a single, coherent optimization problem with provable generalization bounds.

What You Will Learn

By the end of this page, you will understand the Crammer-Singer primal and dual formulations, the multi-class margin definition, the constrained optimization problem, solution algorithms including online and batch methods, generalization theory, and when this approach outperforms decomposition methods.

Multi-class Margin Definition

Before formulating the optimization, we must generalize the concept of "margin" from binary to multi-class classification.

Binary Margin Recap:

For binary SVM with labels $y \in {-1, +1}$, the margin of a point $(\mathbf{x}, y)$ is: $$\gamma = y \cdot (\mathbf{w}^\top \mathbf{x} + b)$$

Correct classification requires $\gamma > 0$; larger margins indicate more confident, robust predictions.

Multi-class Margin Definition:

For K classes, we maintain K weight vectors $\mathbf{w}_1, \ldots, \mathbf{w}K$ (one per class). The prediction for input $\mathbf{x}$ is: $$\hat{y} = \arg\max{k \in {1, \ldots, K}} \mathbf{w}_k^\top \mathbf{x}$$

(We absorb the bias into $\mathbf{w}$ by appending 1 to $\mathbf{x}$.)

The multi-class margin for a point $(\mathbf{x}, y)$ is the gap between the correct class score and the highest incorrect class score: $$\gamma = \mathbf{w}y^\top \mathbf{x} - \max{k \neq y} \mathbf{w}_k^\top \mathbf{x}$$

Margin Interpretation

A positive margin γ > 0 means the correct class has the highest score—correct classification. A larger margin means the correct class dominates all others more decisively. A negative margin indicates misclassification.

Margin Properties:

Correct classification: $\gamma > 0 \Leftrightarrow$ prediction is correct
Confidence measure: $\gamma$ quantifies how much "room" there is before the prediction would change
Robustness: Larger $\gamma$ implies the prediction is robust to small perturbations in $\mathbf{x}$ or $\mathbf{w}$

Example (K=3 classes):

For input $\mathbf{x}$ with true label $y = 2$:

$\mathbf{w}_1^\top \mathbf{x} = 0.5$
$\mathbf{w}_2^\top \mathbf{x} = 2.1$ (correct class)
$\mathbf{w}_3^\top \mathbf{x} = 1.3$

Margin: $\gamma = 2.1 - \max(0.5, 1.3) = 2.1 - 1.3 = 0.8$

The correct class (2) beats the runner-up (class 3) by 0.8. This is our margin to maximize.

The All-vs-All Perspective:

Crammer-Singer requires the correct class to beat every other class by at least margin 1: $$\mathbf{w}_y^\top \mathbf{x} - \mathbf{w}_k^\top \mathbf{x} \geq 1, \quad \forall k \neq y$$

This is equivalent to: $$\mathbf{w}_y^\top \mathbf{x} \geq \mathbf{w}_k^\top \mathbf{x} + 1, \quad \forall k \neq y$$

The correct class score must exceed every incorrect class score by at least 1. This is a single constraint per incorrect class, yielding K-1 constraints per training example.

Comparison to Binary SVM:

Aspect	Binary SVM	Crammer-Singer
Weight vectors	1 ($\mathbf{w}$)	K ($\mathbf{w}_1, \ldots, \mathbf{w}_K$)
Constraints per point	1	K-1
Margin meaning	Signed distance to hyperplane	Gap to nearest competitor
Correct classification	$y \cdot f(\mathbf{x}) > 0$	$\mathbf{w}y^\top \mathbf{x} > \max{k \neq y} \mathbf{w}_k^\top \mathbf{x}$

Primal Optimization Problem

The Crammer-Singer primal formulation extends the binary soft-margin SVM to K classes.

The Crammer-Singer Primal Problem:

$$\min_{\mathbf{w}_1, \ldots, \mathbf{w}K, \boldsymbol{\xi}} \frac{1}{2} \sum{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i$$

subject to: $$\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i \geq 1 - \delta{y_i, k} - \xi_i, \quad \forall i \in [n], \forall k \in [K]$$ $$\xi_i \geq 0, \quad \forall i \in [n]$$

where $\delta_{y_i, k}$ is the Kronecker delta (1 if $y_i = k$, else 0).

Understanding the Constraints

The term (1 - δ) ensures that when k = y_i (correct class), the constraint becomes 0 ≥ -ξ_i, which is always satisfied for ξ_i ≥ 0. For k ≠ y_i, the constraint enforces the margin requirement with slack ξ_i.

Unpacking the Formulation:

Objective Function:

$$\frac{1}{2} \sum_{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i$$

First term: Regularization on all K weight vectors (controls model complexity)
Second term: Penalty for margin violations (controlled by C)

Constraints:

For each training point $i$ and each class $k \neq y_i$: $$\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}_i \geq 1 - \xi_i$$

This says: the correct class score must beat class $k$'s score by at least $1 - \xi_i$.

Special Properties:

Single slack per point: Unlike formulations with $\xi_{ik}$ for each (point, class) pair, Crammer-Singer uses one $\xi_i$ per point. This $\xi_i$ absorbs the maximum violation across all incorrect classes.
Binding constraint determines slack: $\xi_i = \max_k [1 - (\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}k^\top \mathbf{x}i)]+$ where $[z]+ = \max(0, z)$.
Total constraints: $n \cdot K$ constraints (most are trivially satisfied for the correct class).

Alternative Formulation (Equivalent):

We can rewrite the constraints more symmetrically using indicator notation. Let $e_k \in \mathbb{R}^K$ be the one-hot vector for class $k$. Define the combined weight matrix $\mathbf{W} \in \mathbb{R}^{d \times K}$ with columns $\mathbf{w}_1, \ldots, \mathbf{w}_K$.

The constraint becomes: $$\mathbf{W}^\top \mathbf{x}i \cdot (e{y_i} - e_k) \geq 1 - \delta_{y_i, k} - \xi_i$$

In matrix form, this is elegant but the practical optimization uses the explicit constraints.

Relation to Multi-class Hinge Loss:

The Crammer-Singer formulation is equivalent to minimizing the multi-class hinge loss: $$L_{CS}(\mathbf{x}, y) = \max_{k \neq y} \left( 1 + \mathbf{w}_k^\top \mathbf{x} - \mathbf{w}y^\top \mathbf{x} \right)+$$

This is the largest violation among all incorrect classes, matching our single-slack-per-point formulation.

crammer_singer_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
 
def crammer_singer_loss(W: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
    """
    Compute Crammer-Singer multi-class hinge loss.
    
    Parameters:
    -----------
    W : array of shape (n_features, n_classes)
        Weight matrix with columns as class weight vectors
    X : array of shape (n_samples, n_features)
        Input data
    y : array of shape (n_samples,)
        True labels (0-indexed)
    
    Returns:
    --------
    loss : float
        Average Crammer-Singer loss
    """
    n_samples, n_features = X.shape
    n_classes = W.shape[1]
    
    # Compute all scores: (n_samples, n_classes)
    scores = X @ W
    
    total_loss = 0.0
    for i in range(n_samples):
        correct_class = y[i]
        correct_score = scores[i, correct_class]
        
        # Find maximum margin violation among incorrect classes
        max_violation = 0.0
        for k in range(n_classes):
            if k != correct_class:
                # Margin violation: 1 + score_k - score_correct
                violation = 1 + scores[i, k] - correct_score
                max_violation = max(max_violation, violation)
        
        # Hinge: take max with 0
        total_loss += max(0, max_violation)
    
    return total_loss / n_samples
 
def crammer_singer_loss_vectorized(
    W: np.ndarray, 
    X: np.ndarray, 
    y: np.ndarray
) -> float:
    """
    Vectorized Crammer-Singer loss computation.
    """
    scores = X @ W  # (n_samples, n_classes)
    n_samples = X.shape[0]
    
    # Get correct class scores
    correct_scores = scores[np.arange(n_samples), y]  # (n_samples,)
    
    # Compute margins: 1 + score_k - score_correct for all k
    margins = 1 + scores - correct_scores[:, np.newaxis]  # (n_samples, n_classes)
    
    # Zero out the correct class margins (we want max over k != y)
    margins[np.arange(n_samples), y] = 0
    
    # Max margin violation per sample
    max_violations = np.max(margins, axis=1)
    
    # Hinge: max(0, max_violation)
    losses = np.maximum(0, max_violations)
    
    return np.mean(losses)
 
def crammer_singer_gradient(
    W: np.ndarray, 
    X: np.ndarray, 
    y: np.ndarray
) -> np.ndarray:
    """
    Compute gradient of Crammer-Singer loss w.r.t. W.
    
    Returns:
    --------
    gradient : array of shape (n_features, n_classes)
    """
    scores = X @ W
    n_samples, n_features = X.shape
    n_classes = W.shape[1]
    
    # Compute margins
    correct_scores = scores[np.arange(n_samples), y]
    margins = 1 + scores - correct_scores[:, np.newaxis]
    margins[np.arange(n_samples), y] = -np.inf  # Exclude correct class
    
    # Find the argmax class (the violator)
    violating_class = np.argmax(margins, axis=1)  # (n_samples,)
    max_margins = margins[np.arange(n_samples), violating_class]
    
    # Gradient contribution only where loss > 0
    active = max_margins > 0  # (n_samples,)
    
    gradient = np.zeros_like(W)
    
    for i in range(n_samples):
        if active[i]:
            k = violating_class[i]
            yi = y[i]
            # Gradient: increase w_k, decrease w_yi
            gradient[:, k] += X[i] / n_samples
            gradient[:, yi] -= X[i] / n_samples
    
    return gradient

Dual Formulation

Like binary SVM, Crammer-Singer has a dual formulation that enables kernelization and provides computational advantages for certain problem sizes.

Deriving the Dual:

Introduce Lagrange multipliers $\alpha_{ik} \geq 0$ for each constraint (training point $i$, class $k$). The Lagrangian is:

$$L = \frac{1}{2} \sum_{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i - \sum_{i,k} \alpha_{ik} \left[ \mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i - 1 + \delta{y_i,k} + \xi_i \right]$$

KKT Conditions:

Taking derivatives and setting to zero:

$$\frac{\partial L}{\partial \mathbf{w}k} = 0 \Rightarrow \mathbf{w}k = \sum{i: y_i = k} \left( \sum{j \neq k} \alpha_{ij} \right) \mathbf{x}i - \sum{i: y_i \neq k} \alpha_{ik} \mathbf{x}_i$$

$$\frac{\partial L}{\partial \xi_i} = 0 \Rightarrow C = \sum_{k} \alpha_{ik}$$

Dual Variable Interpretation

Each α_ik measures how much the constraint for point i against class k is 'active'. Points with non-zero α are support vectors. The constraint Σ_k α_ik = C ensures the single-slack-per-point structure.

The Crammer-Singer Dual Problem:

After substitution and simplification, the dual becomes:

$$\max_{\boldsymbol{\alpha}} \sum_{i,k} \alpha_{ik} (1 - \delta_{y_i, k}) - \frac{1}{2} \sum_{k=1}^{K} \left| \sum_i b_{ik} \mathbf{x}_i \right|^2$$

where $b_{ik} = \alpha_{ik} - \delta_{y_i, k} \sum_j \alpha_{ij}$

subject to: $$\alpha_{ik} \geq 0, \quad \forall i, k$$ $$\sum_{k} \alpha_{ik} = C, \quad \forall i$$ $$\alpha_{i, y_i} = 0, \quad \forall i$$

Key Properties of the Dual:

Dimension: $(n \times K)$ variables, but each row sums to $C$ and one entry is 0, so effectively $n(K-1)$ free variables.
Sparsity: Most $\alpha_{ik}$ are zero; only support vectors have non-zero values.
Kernelization: The dual depends on $\mathbf{x}i$ only through inner products, enabling kernel extension: $$|\sum_i b{ik} \mathbf{x}i|^2 = \sum{i,j} b_{ik} b_{jk} \langle \mathbf{x}i, \mathbf{x}j \rangle = \sum{i,j} b{ik} b_{jk} K(\mathbf{x}_i, \mathbf{x}_j)$$

crammer_singer_dual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from scipy.optimize import minimize
 
class CrammerSingerDual:
    """
    Crammer-Singer SVM solved via dual formulation.
    
    Supports kernel functions for non-linear classification.
    """
    
    def __init__(self, C: float = 1.0, kernel: str = 'linear', gamma: float = 1.0):
        """
        Parameters:
        -----------
        C : float
            Regularization parameter
        kernel : str
            'linear' or 'rbf'
        gamma : float
            RBF kernel parameter (ignored for linear)
        """
        self.C = C
        self.kernel = kernel
        self.gamma = gamma
        self.alpha = None
        self.X_train = None
        self.y_train = None
        self.classes_ = None
    
    def _compute_kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute kernel matrix."""
        if self.kernel == 'linear':
            return X1 @ X2.T
        elif self.kernel == 'rbf':
            # ||x - y||^2 = ||x||^2 + ||y||^2 - 2 x·y
            sq_X1 = np.sum(X1 ** 2, axis=1, keepdims=True)
            sq_X2 = np.sum(X2 ** 2, axis=1, keepdims=True)
            dist_sq = sq_X1 + sq_X2.T - 2 * (X1 @ X2.T)
            return np.exp(-self.gamma * dist_sq)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel}")
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'CrammerSingerDual':
        """
        Fit the Crammer-Singer SVM.
        """
        self.X_train = X
        self.y_train = y
        self.classes_ = np.unique(y)
        self.n_classes = len(self.classes_)
        
        n_samples, n_features = X.shape
        K = self.n_classes
        
        # Precompute kernel matrix
        self.K_matrix = self._compute_kernel(X, X)
        
        # Initialize alpha: shape (n_samples, n_classes)
        # Each row sums to C, entry at y_i is 0
        alpha_init = np.zeros((n_samples, K))
        for i in range(n_samples):
            # Distribute C evenly among incorrect classes
            for k in range(K):
                if k != y[i]:
                    alpha_init[i, k] = self.C / (K - 1)
        
        # Flatten for optimizer
        alpha_flat = alpha_init.flatten()
        
        # Optimization with constraints
        from scipy.optimize import minimize, LinearConstraint
        
        def objective(alpha_flat):
            alpha = alpha_flat.reshape((n_samples, K))
            
            # First term: sum of alpha * (1 - delta)
            term1 = 0
            for i in range(n_samples):
                for k in range(K):
                    if k != y[i]:
                        term1 += alpha[i, k]
            
            # Second term: quadratic
            term2 = 0
            for k in range(K):
                # Compute b_ik for this class
                b_k = np.zeros(n_samples)
                for i in range(n_samples):
                    sum_alpha_i = np.sum(alpha[i])
                    b_k[i] = alpha[i, k] - (1 if y[i] == k else 0) * sum_alpha_i
                
                # ||sum_i b_ik x_i||^2 = b_k^T K b_k
                term2 += b_k @ self.K_matrix @ b_k
            
            # Dual maximization = minimize negative
            return -(term1 - 0.5 * term2)
        
        # This is a simplified version; production code uses specialized solvers
        print("Solving Crammer-Singer dual optimization...")
        result = minimize(
            objective, 
            alpha_flat,
            method='L-BFGS-B',
            bounds=[(0, self.C) for _ in range(n_samples * K)],
            options={'maxiter': 1000}
        )
        
        self.alpha = result.x.reshape((n_samples, K))
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class labels.
        """
        # Compute kernel between test and training points
        K_test = self._compute_kernel(X, self.X_train)
        
        n_samples = X.shape[0]
        n_train = self.X_train.shape[0]
        K = self.n_classes
        
        # Compute w_k^T x for each class via kernel expansion
        scores = np.zeros((n_samples, K))
        
        for k in range(K):
            # b_ik for training points
            b_k = np.zeros(n_train)
            for i in range(n_train):
                sum_alpha_i = np.sum(self.alpha[i])
                b_k[i] = self.alpha[i, k] - (1 if self.y_train[i] == k else 0) * sum_alpha_i
            
            # Score for class k: sum_i b_ik K(x, x_i)
            scores[:, k] = K_test @ b_k
        
        return self.classes_[np.argmax(scores, axis=1)]

Optimization Algorithms

Solving the Crammer-Singer optimization is computationally challenging due to the coupled structure (single slack per point implies constraints interact). Several specialized algorithms have been developed.

1. Sequential Minimal Optimization (SMO) Extension

The SMO algorithm, which efficiently solves binary SVMs by optimizing two variables at a time, can be extended to Crammer-Singer. The key insight is that due to the $\sum_k \alpha_{ik} = C$ constraint, we must optimize at least two $\alpha_{ik}$ values simultaneously for each point $i$.

Modified SMO for Crammer-Singer:

Select a training point $i$ that violates KKT conditions
Select two classes $k$ and $l$ with the largest gradient difference
Update $\alpha_{ik}$ and $\alpha_{il}$ while keeping $\alpha_{ik} + \alpha_{il}$ constant
Repeat until convergence

This approach, used in LIBLINEAR, achieves near-optimal efficiency for medium-sized problems.

Working Set Selection

Efficient working set selection is crucial for SMO-like algorithms. Heuristics like Maximum Violating Pair (selecting the two classes with largest KKT violation) significantly accelerate convergence.

2. Stochastic Gradient Descent (SGD)

For large-scale problems, SGD on the primal is often most practical:

Update Rule:

For each randomly sampled $(\mathbf{x}_i, y_i)$:

Find the violating class: $k^* = \arg\max_{k \neq y_i} (1 + \mathbf{w}_k^\top \mathbf{x}i - \mathbf{w}{y_i}^\top \mathbf{x}_i)$
If margin violated ($1 + \mathbf{w}_{k^*}^\top \mathbf{x}i - \mathbf{w}{y_i}^\top \mathbf{x}_i > 0$):
- Update: $\mathbf{w}{k^*} \leftarrow \mathbf{w}{k^*} - \eta \mathbf{x}_i$
- Update: $\mathbf{w}{y_i} \leftarrow \mathbf{w}{y_i} + \eta \mathbf{x}_i$
Regularization update: $\mathbf{w}_k \leftarrow (1 - \eta \lambda) \mathbf{w}_k$ for all $k$

3. Coordinate Descent

Optimize one coordinate of $\mathbf{w}$ at a time, cycling through all $(d \times K)$ coordinates. Efficient for sparse data where coordinate updates affect few constraints.

4. PEGASOS-style Algorithm

Primal Estimated sub-GrAdient SOlver for SVM, extended to multi-class:

$$\mathbf{w}_k^{(t+1)} = \frac{1}{\lambda (t+1)} \left[ \mathbf{w}k^{(t)} - \eta_t \nabla_k L{CS}(\mathbf{x}_t, y_t) \right]$$

with projection to enforce bounded norm.

crammer_singer_sgd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
 
class CrammerSingerSGD:
    """
    Crammer-Singer SVM via Stochastic Gradient Descent.
    
    Efficient for large-scale multi-class classification.
    """
    
    def __init__(
        self, 
        n_classes: int,
        n_features: int,
        C: float = 1.0,
        learning_rate: float = 0.01,
        n_epochs: int = 100,
        decay: float = 0.01
    ):
        self.n_classes = n_classes
        self.n_features = n_features
        self.C = C
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.decay = decay
        self.lambda_reg = 1 / (C * n_features)  # Regularization strength
        
        # Initialize weights: (n_features, n_classes)
        self.W = np.zeros((n_features, n_classes))
        
    def _compute_loss_and_update(
        self, 
        x: np.ndarray, 
        y: int, 
        lr: float
    ) -> float:
        """
        Compute loss for single sample and update weights.
        
        Returns loss value for monitoring.
        """
        # Compute scores for all classes
        scores = x @ self.W  # (n_classes,)
        
        # Find the most violating class (not y)
        margins = 1 + scores - scores[y]
        margins[y] = -np.inf  # Exclude correct class
        k_star = np.argmax(margins)
        margin_violation = margins[k_star]
        
        # Compute hinge loss
        loss = max(0, margin_violation)
        
        if loss > 0:
            # Gradient update: push w_y higher, push w_k* lower
            self.W[:, y] += lr * x
            self.W[:, k_star] -= lr * x
        
        # Regularization (weight decay)
        self.W *= (1 - lr * self.lambda_reg)
        
        return loss
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'CrammerSingerSGD':
        """
        Train the classifier using SGD.
        """
        n_samples = X.shape[0]
        
        for epoch in range(self.n_epochs):
            # Shuffle data
            indices = np.random.permutation(n_samples)
            
            epoch_loss = 0.0
            for idx in indices:
                # Learning rate schedule
                t = epoch * n_samples + idx + 1
                lr = self.learning_rate / (1 + self.decay * t)
                
                loss = self._compute_loss_and_update(X[idx], y[idx], lr)
                epoch_loss += loss
            
            avg_loss = epoch_loss / n_samples
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch + 1}/{self.n_epochs}, "
                      f"Avg Loss: {avg_loss:.4f}")
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        scores = X @ self.W
        return np.argmax(scores, axis=1)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Compute class scores (not probabilities, but proportional).
        
        Use softmax for calibrated probabilities.
        """
        scores = X @ self.W
        # Softmax normalization
        exp_scores = np.exp(scores - np.max(scores, axis=1, keepdims=True))
        return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 
 
def train_crammer_singer_mini_batch(
    X: np.ndarray,
    y: np.ndarray,
    n_classes: int,
    batch_size: int = 32,
    C: float = 1.0,
    n_epochs: int = 100
) -> np.ndarray:
    """
    Mini-batch SGD for Crammer-Singer.
    
    More stable than pure SGD due to gradient averaging.
    """
    n_samples, n_features = X.shape
    W = np.zeros((n_features, n_classes))
    lambda_reg = 1 / (C * n_samples)
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        
        for batch_start in range(0, n_samples, batch_size):
            batch_end = min(batch_start + batch_size, n_samples)
            batch_idx = indices[batch_start:batch_end]
            
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]
            actual_batch = len(batch_idx)
            
            # Compute gradient for batch
            gradient = np.zeros_like(W)
            
            for i in range(actual_batch):
                scores = X_batch[i] @ W
                margins = 1 + scores - scores[y_batch[i]]
                margins[y_batch[i]] = -np.inf
                k_star = np.argmax(margins)
                
                if margins[k_star] > 0:
                    gradient[:, y_batch[i]] -= X_batch[i]
                    gradient[:, k_star] += X_batch[i]
            
            gradient /= actual_batch
            
            # Learning rate
            t = epoch * (n_samples // batch_size) + batch_start // batch_size + 1
            lr = 1.0 / (lambda_reg * t)
            
            # Update with regularization
            W = (1 - lr * lambda_reg) * W - lr * gradient
    
    return W

Theoretical Properties and Generalization

The Crammer-Singer formulation has strong theoretical foundations with provable generalization bounds.

Generalization Bound (Crammer & Singer, 2001):

With probability at least $1 - \delta$, the generalization error of the Crammer-Singer classifier is bounded by:

$$\text{err}(f) \leq \frac{1}{n} \sum_i \mathbb{1}[\gamma_i < 1] + O\left( \sqrt{\frac{R^2 |\mathbf{W}|_F^2 + \log(1/\delta)}{n}} \right)$$

where $R$ is the maximum norm of training examples and $|\mathbf{W}|_F$ is the Frobenius norm of the weight matrix.

Interpretation:

First term: empirical margin error (fraction with margin < 1)
Second term: complexity penalty depending on weight magnitude
Larger margins (smaller weights for same decision) → better generalization

Why Joint Optimization Matters

The generalization bound directly relates to the joint weight matrix norm, which is what Crammer-Singer explicitly minimizes. Decomposition methods minimize individual classifier weights but don't control their joint complexity.

Comparison with Decomposition Methods:

Aspect	Crammer-Singer	OvO/OvA
Optimizes	Global multi-class margin	Individual binary margins
Generalization bound	Direct (joint)	Indirect (per-classifier)
Parameter coupling	Yes (all K classes)	No (independent)
Theoretical elegance	High	Moderate

Margin Definition Matters:

Crammer-Singer's margin is the gap between correct and best incorrect class. This is a more demanding criterion than OvA (correct class beats "rest") or OvO (each pair separately).

Fisher Consistency:

The Crammer-Singer loss is Fisher consistent: as $n \to \infty$, the minimizer of the empirical risk converges to the Bayes optimal classifier (assuming appropriate model capacity).

Comparison with Weston-Watkins:

Another multi-class SVM formulation (Weston & Watkins, 1999) uses a slack variable $\xi_{ik}$ for each (point, class) pair:

$$\min \frac{1}{2} \sum_k |\mathbf{w}k|^2 + C \sum{i,k} \xi_{ik}$$

subject to: $\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i \geq 1 - \xi{ik}$

Weston-Watkins penalizes all margin violations; Crammer-Singer penalizes only the maximum violation. Crammer-Singer is generally preferred as it focuses on the hardest competitor.

Multi-class SVM Formulation Comparison
Formulation	Slack Type	Penalty Focus	Complexity
Crammer-Singer	One per point	Max violation	Lower
Weston-Watkins	One per (point, class)	All violations	Higher
Lee et al. (2004)	One per point (sum)	Sum of violations	Medium

When to Use Crammer-Singer vs Decomposition

Given the theoretical elegance of Crammer-Singer, why isn't it always used? The answer lies in practical tradeoffs.

Empirical Observations:

Accuracy: In practice, well-tuned OvO/OvA often achieve comparable or even better accuracy than Crammer-Singer on many datasets. The theoretical advantages don't always translate to significant practical gains.
Training speed: OvO trains K(K-1)/2 small problems in parallel; Crammer-Singer solves one large coupled problem. For large n and K, decomposition can be faster.
Scalability: Decomposition methods scale linearly with number of processors (embarrassingly parallel). Crammer-Singer has limited parallelization opportunities.
Library support: OvO/OvA are the defaults in sklearn, LIBSVM, and most libraries. Crammer-Singer requires specialized solvers.

Choose Crammer-Singer When

•Theoretical guarantees matter: Research, high-stakes applications
•Classes are highly confusable: Joint optimization helps distinguish similar classes
•Fine margin control needed: Explicit control over multi-class margin
•K is moderate: Computational complexity is manageable
•Unified model desired: Single optimization, coherent weight structure

Choose Decomposition When

•K is large (>100): Scalability becomes critical
•Parallel training available: Can utilize multi-core/distributed systems
•Rapid iteration needed: Faster development cycles with off-the-shelf tools
•Interpretability important: Binary classifiers easier to analyze
•Established pipelines exist: Easier integration with existing code

The Pragmatic Choice

In industry, decomposition methods dominate due to practical considerations. In research on multi-class learning theory or when maximum accuracy is critical, Crammer-Singer provides a principled baseline.

Practical Implementation Notes

Implementing Crammer-Singer efficiently requires attention to several practical considerations:

1. Initialization:

Weight initialization affects convergence. Common strategies:

Zero initialization: Simple, often works well with regularization
Small random: Breaks symmetry, may help for some problems
OvA warm-start: Initialize with OvA solution, refine with joint optimization

2. Feature Preprocessing:

Standardization: Critical for gradient-based methods; ensures balanced updates across features
Sparse handling: For text/sparse data, use sparse matrix operations throughout

3. Regularization Tuning:

The C parameter (or equivalently, λ = 1/C) requires careful tuning:

Too small C: Underfitting, all margins violated
Too large C: Overfitting, memorizes training data
Use k-fold cross-validation on training data

crammer_singer_practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
 
class PracticalCrammerSinger(BaseEstimator, ClassifierMixin):
    """
    Production-ready Crammer-Singer SVM with sklearn compatibility.
    
    Features:
    - Automatic label encoding
    - Feature scaling option
    - Cross-validation for C selection
    - Warm-start from OvA
    """
    
    def __init__(
        self,
        C: float = 1.0,
        learning_rate: float = 0.01,
        n_epochs: int = 100,
        scale_features: bool = True,
        warm_start: bool = False,
        random_state: int = 42
    ):
        self.C = C
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.scale_features = scale_features
        self.warm_start = warm_start
        self.random_state = random_state
        
    def fit(self, X, y):
        """
        Fit the Crammer-Singer SVM.
        """
        np.random.seed(self.random_state)
        
        # Encode labels
        self.label_encoder_ = LabelEncoder()
        y_encoded = self.label_encoder_.fit_transform(y)
        self.classes_ = self.label_encoder_.classes_
        self.n_classes_ = len(self.classes_)
        
        # Scale features
        if self.scale_features:
            self.scaler_ = StandardScaler()
            X_scaled = self.scaler_.fit_transform(X)
        else:
            X_scaled = X
        
        n_samples, n_features = X_scaled.shape
        
        # Initialize weights
        if self.warm_start:
            # Initialize from OvA solution
            self.W_ = self._init_from_ova(X_scaled, y_encoded)
        else:
            self.W_ = np.zeros((n_features, self.n_classes_))
        
        # Train with SGD
        lambda_reg = 1 / (self.C * n_samples)
        
        for epoch in range(self.n_epochs):
            indices = np.random.permutation(n_samples)
            
            for idx in indices:
                x = X_scaled[idx]
                yi = y_encoded[idx]
                
                # Learning rate decay
                t = epoch * n_samples + idx + 1
                lr = self.learning_rate / np.sqrt(t)
                
                # Forward pass
                scores = x @ self.W_
                margins = 1 + scores - scores[yi]
                margins[yi] = -np.inf
                k_star = np.argmax(margins)
                
                # Update if margin violated
                if margins[k_star] > 0:
                    self.W_[:, yi] += lr * x
                    self.W_[:, k_star] -= lr * x
                
                # Regularization
                self.W_ *= (1 - lr * lambda_reg)
        
        return self
    
    def _init_from_ova(self, X, y):
        """
        Initialize weights using OvA classifiers.
        """
        from sklearn.svm import LinearSVC
        
        n_features = X.shape[1]
        W = np.zeros((n_features, self.n_classes_))
        
        for k in range(self.n_classes_):
            y_binary = (y == k).astype(int) * 2 - 1  # {-1, +1}
            clf = LinearSVC(C=self.C, max_iter=100)
            clf.fit(X, y_binary)
            W[:, k] = clf.coef_.flatten()
        
        return W
    
    def predict(self, X):
        """Predict class labels."""
        if self.scale_features:
            X_scaled = self.scaler_.transform(X)
        else:
            X_scaled = X
        
        scores = X_scaled @ self.W_
        y_pred_encoded = np.argmax(scores, axis=1)
        return self.label_encoder_.inverse_transform(y_pred_encoded)
    
    def score(self, X, y):
        """Accuracy score."""
        return np.mean(self.predict(X) == y)
 
 
def tune_crammer_singer_c(X, y, C_values=None, cv=5):
    """
    Tune C parameter via cross-validation.
    """
    if C_values is None:
        C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
    
    best_C = None
    best_score = 0
    
    for C in C_values:
        clf = PracticalCrammerSinger(C=C, n_epochs=50)
        scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
        mean_score = np.mean(scores)
        
        print(f"C={C:>6}, CV Accuracy: {mean_score:.4f} (+/- {np.std(scores):.4f})")
        
        if mean_score > best_score:
            best_score = mean_score
            best_C = C
    
    print(f"\nBest C: {best_C}, Best CV Accuracy: {best_score:.4f}")
    return best_C

Summary: Crammer-Singer Formulation

We have thoroughly explored the Crammer-Singer multi-class SVM formulation—the most theoretically principled approach to extending SVMs beyond binary classification. Let's consolidate the key insights:

Key Takeaways

•Unified optimization — Single problem considers all K classes jointly, not decomposed into binary subproblems.
•Multi-class margin — Maximizes gap between correct class and the best (most confusing) incorrect class.
•Single slack per point — Penalizes maximum violation, focusing on the hardest competitor.
•Kernelizable — Dual formulation enables non-linear classification via kernel functions.
•Strong theory — Direct generalization bounds on the joint weight structure.
•SGD solvable — Efficient training via stochastic optimization for large-scale problems.
•Practical tradeoff — Elegance vs. scalability; decomposition often wins in practice.

Page Complete

You now understand the Crammer-Singer formulation in depth—from margin definition through optimization algorithms to practical implementation. Next, we'll cover practical considerations for deploying multi-class SVMs in production, including hyperparameter tuning, computational tradeoffs, and library selection.

What's Next:

The final page of this module covers Practical Considerations for multi-class SVM deployment. We'll synthesize everything learned and provide concrete guidance on method selection, hyperparameter tuning, scalability strategies, and common pitfalls to avoid in production systems.

4 / 5

Loading learning content...

Machine LearningMulti-class SVM

Multi-class Support Vector Machines

LevelAdvanced

Duration90 mins

TopicMulti-class SVM

4 / 5

Crammer-Singer Formulation

A Unified Multi-class Objective

What if we could formulate a single optimization problem that simultaneously considers all K classes, directly maximizing the margin between the correct class and all incorrect classes?

What You Will Learn

Multi-class Margin Definition

Before formulating the optimization, we must generalize the concept of "margin" from binary to multi-class classification.

Binary Margin Recap:

For binary SVM with labels $y \in {-1, +1}$, the margin of a point $(\mathbf{x}, y)$ is: $$\gamma = y \cdot (\mathbf{w}^\top \mathbf{x} + b)$$

Correct classification requires $\gamma > 0$; larger margins indicate more confident, robust predictions.

Multi-class Margin Definition:

(We absorb the bias into $\mathbf{w}$ by appending 1 to $\mathbf{x}$.)

Margin Interpretation

Margin Properties:

Correct classification: $\gamma > 0 \Leftrightarrow$ prediction is correct
Confidence measure: $\gamma$ quantifies how much "room" there is before the prediction would change
Robustness: Larger $\gamma$ implies the prediction is robust to small perturbations in $\mathbf{x}$ or $\mathbf{w}$

Example (K=3 classes):

For input $\mathbf{x}$ with true label $y = 2$:

$\mathbf{w}_1^\top \mathbf{x} = 0.5$
$\mathbf{w}_2^\top \mathbf{x} = 2.1$ (correct class)
$\mathbf{w}_3^\top \mathbf{x} = 1.3$

Margin: $\gamma = 2.1 - \max(0.5, 1.3) = 2.1 - 1.3 = 0.8$

The correct class (2) beats the runner-up (class 3) by 0.8. This is our margin to maximize.

The All-vs-All Perspective:

Crammer-Singer requires the correct class to beat every other class by at least margin 1: $$\mathbf{w}_y^\top \mathbf{x} - \mathbf{w}_k^\top \mathbf{x} \geq 1, \quad \forall k \neq y$$

This is equivalent to: $$\mathbf{w}_y^\top \mathbf{x} \geq \mathbf{w}_k^\top \mathbf{x} + 1, \quad \forall k \neq y$$

The correct class score must exceed every incorrect class score by at least 1. This is a single constraint per incorrect class, yielding K-1 constraints per training example.

Comparison to Binary SVM:

Aspect	Binary SVM	Crammer-Singer
Weight vectors	1 ($\mathbf{w}$)	K ($\mathbf{w}_1, \ldots, \mathbf{w}_K$)
Constraints per point	1	K-1
Margin meaning	Signed distance to hyperplane	Gap to nearest competitor
Correct classification	$y \cdot f(\mathbf{x}) > 0$	$\mathbf{w}y^\top \mathbf{x} > \max{k \neq y} \mathbf{w}_k^\top \mathbf{x}$

Primal Optimization Problem

The Crammer-Singer primal formulation extends the binary soft-margin SVM to K classes.

The Crammer-Singer Primal Problem:

$$\min_{\mathbf{w}_1, \ldots, \mathbf{w}K, \boldsymbol{\xi}} \frac{1}{2} \sum{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i$$

where $\delta_{y_i, k}$ is the Kronecker delta (1 if $y_i = k$, else 0).

Understanding the Constraints

Unpacking the Formulation:

Objective Function:

$$\frac{1}{2} \sum_{k=1}^{K} |\mathbf{w}k|^2 + C \sum{i=1}^{n} \xi_i$$

First term: Regularization on all K weight vectors (controls model complexity)
Second term: Penalty for margin violations (controlled by C)

Constraints:

For each training point $i$ and each class $k \neq y_i$: $$\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}_i \geq 1 - \xi_i$$

This says: the correct class score must beat class $k$'s score by at least $1 - \xi_i$.

Special Properties:

Single slack per point: Unlike formulations with $\xi_{ik}$ for each (point, class) pair, Crammer-Singer uses one $\xi_i$ per point. This $\xi_i$ absorbs the maximum violation across all incorrect classes.
Binding constraint determines slack: $\xi_i = \max_k [1 - (\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}k^\top \mathbf{x}i)]+$ where $[z]+ = \max(0, z)$.
Total constraints: $n \cdot K$ constraints (most are trivially satisfied for the correct class).

Alternative Formulation (Equivalent):

The constraint becomes: $$\mathbf{W}^\top \mathbf{x}i \cdot (e{y_i} - e_k) \geq 1 - \delta_{y_i, k} - \xi_i$$

In matrix form, this is elegant but the practical optimization uses the explicit constraints.

Relation to Multi-class Hinge Loss:

This is the largest violation among all incorrect classes, matching our single-slack-per-point formulation.

crammer_singer_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
 
def crammer_singer_loss(W: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
    """
    Compute Crammer-Singer multi-class hinge loss.
    
    Parameters:
    -----------
    W : array of shape (n_features, n_classes)
        Weight matrix with columns as class weight vectors
    X : array of shape (n_samples, n_features)
        Input data
    y : array of shape (n_samples,)
        True labels (0-indexed)
    
    Returns:
    --------
    loss : float
        Average Crammer-Singer loss
    """
    n_samples, n_features = X.shape
    n_classes = W.shape[1]
    
    # Compute all scores: (n_samples, n_classes)
    scores = X @ W
    
    total_loss = 0.0
    for i in range(n_samples):
        correct_class = y[i]
        correct_score = scores[i, correct_class]
        
        # Find maximum margin violation among incorrect classes
        max_violation = 0.0
        for k in range(n_classes):
            if k != correct_class:
                # Margin violation: 1 + score_k - score_correct
                violation = 1 + scores[i, k] - correct_score
                max_violation = max(max_violation, violation)
        
        # Hinge: take max with 0
        total_loss += max(0, max_violation)
    
    return total_loss / n_samples
 
def crammer_singer_loss_vectorized(
    W: np.ndarray, 
    X: np.ndarray, 
    y: np.ndarray
) -> float:
    """
    Vectorized Crammer-Singer loss computation.
    """
    scores = X @ W  # (n_samples, n_classes)
    n_samples = X.shape[0]
    
    # Get correct class scores
    correct_scores = scores[np.arange(n_samples), y]  # (n_samples,)
    
    # Compute margins: 1 + score_k - score_correct for all k
    margins = 1 + scores - correct_scores[:, np.newaxis]  # (n_samples, n_classes)
    
    # Zero out the correct class margins (we want max over k != y)
    margins[np.arange(n_samples), y] = 0
    
    # Max margin violation per sample
    max_violations = np.max(margins, axis=1)
    
    # Hinge: max(0, max_violation)
    losses = np.maximum(0, max_violations)
    
    return np.mean(losses)
 
def crammer_singer_gradient(
    W: np.ndarray, 
    X: np.ndarray, 
    y: np.ndarray
) -> np.ndarray:
    """
    Compute gradient of Crammer-Singer loss w.r.t. W.
    
    Returns:
    --------
    gradient : array of shape (n_features, n_classes)
    """
    scores = X @ W
    n_samples, n_features = X.shape
    n_classes = W.shape[1]
    
    # Compute margins
    correct_scores = scores[np.arange(n_samples), y]
    margins = 1 + scores - correct_scores[:, np.newaxis]
    margins[np.arange(n_samples), y] = -np.inf  # Exclude correct class
    
    # Find the argmax class (the violator)
    violating_class = np.argmax(margins, axis=1)  # (n_samples,)
    max_margins = margins[np.arange(n_samples), violating_class]
    
    # Gradient contribution only where loss > 0
    active = max_margins > 0  # (n_samples,)
    
    gradient = np.zeros_like(W)
    
    for i in range(n_samples):
        if active[i]:
            k = violating_class[i]
            yi = y[i]
            # Gradient: increase w_k, decrease w_yi
            gradient[:, k] += X[i] / n_samples
            gradient[:, yi] -= X[i] / n_samples
    
    return gradient

Dual Formulation

Like binary SVM, Crammer-Singer has a dual formulation that enables kernelization and provides computational advantages for certain problem sizes.

Deriving the Dual:

Introduce Lagrange multipliers $\alpha_{ik} \geq 0$ for each constraint (training point $i$, class $k$). The Lagrangian is:

KKT Conditions:

Taking derivatives and setting to zero:

$$\frac{\partial L}{\partial \mathbf{w}k} = 0 \Rightarrow \mathbf{w}k = \sum{i: y_i = k} \left( \sum{j \neq k} \alpha_{ij} \right) \mathbf{x}i - \sum{i: y_i \neq k} \alpha_{ik} \mathbf{x}_i$$

$$\frac{\partial L}{\partial \xi_i} = 0 \Rightarrow C = \sum_{k} \alpha_{ik}$$

Dual Variable Interpretation

The Crammer-Singer Dual Problem:

After substitution and simplification, the dual becomes:

$$\max_{\boldsymbol{\alpha}} \sum_{i,k} \alpha_{ik} (1 - \delta_{y_i, k}) - \frac{1}{2} \sum_{k=1}^{K} \left| \sum_i b_{ik} \mathbf{x}_i \right|^2$$

where $b_{ik} = \alpha_{ik} - \delta_{y_i, k} \sum_j \alpha_{ij}$

subject to: $$\alpha_{ik} \geq 0, \quad \forall i, k$$ $$\sum_{k} \alpha_{ik} = C, \quad \forall i$$ $$\alpha_{i, y_i} = 0, \quad \forall i$$

Key Properties of the Dual:

Dimension: $(n \times K)$ variables, but each row sums to $C$ and one entry is 0, so effectively $n(K-1)$ free variables.
Sparsity: Most $\alpha_{ik}$ are zero; only support vectors have non-zero values.
Kernelization: The dual depends on $\mathbf{x}i$ only through inner products, enabling kernel extension: $$|\sum_i b{ik} \mathbf{x}i|^2 = \sum{i,j} b_{ik} b_{jk} \langle \mathbf{x}i, \mathbf{x}j \rangle = \sum{i,j} b{ik} b_{jk} K(\mathbf{x}_i, \mathbf{x}_j)$$

crammer_singer_dual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from scipy.optimize import minimize
 
class CrammerSingerDual:
    """
    Crammer-Singer SVM solved via dual formulation.
    
    Supports kernel functions for non-linear classification.
    """
    
    def __init__(self, C: float = 1.0, kernel: str = 'linear', gamma: float = 1.0):
        """
        Parameters:
        -----------
        C : float
            Regularization parameter
        kernel : str
            'linear' or 'rbf'
        gamma : float
            RBF kernel parameter (ignored for linear)
        """
        self.C = C
        self.kernel = kernel
        self.gamma = gamma
        self.alpha = None
        self.X_train = None
        self.y_train = None
        self.classes_ = None
    
    def _compute_kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Compute kernel matrix."""
        if self.kernel == 'linear':
            return X1 @ X2.T
        elif self.kernel == 'rbf':
            # ||x - y||^2 = ||x||^2 + ||y||^2 - 2 x·y
            sq_X1 = np.sum(X1 ** 2, axis=1, keepdims=True)
            sq_X2 = np.sum(X2 ** 2, axis=1, keepdims=True)
            dist_sq = sq_X1 + sq_X2.T - 2 * (X1 @ X2.T)
            return np.exp(-self.gamma * dist_sq)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel}")
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'CrammerSingerDual':
        """
        Fit the Crammer-Singer SVM.
        """
        self.X_train = X
        self.y_train = y
        self.classes_ = np.unique(y)
        self.n_classes = len(self.classes_)
        
        n_samples, n_features = X.shape
        K = self.n_classes
        
        # Precompute kernel matrix
        self.K_matrix = self._compute_kernel(X, X)
        
        # Initialize alpha: shape (n_samples, n_classes)
        # Each row sums to C, entry at y_i is 0
        alpha_init = np.zeros((n_samples, K))
        for i in range(n_samples):
            # Distribute C evenly among incorrect classes
            for k in range(K):
                if k != y[i]:
                    alpha_init[i, k] = self.C / (K - 1)
        
        # Flatten for optimizer
        alpha_flat = alpha_init.flatten()
        
        # Optimization with constraints
        from scipy.optimize import minimize, LinearConstraint
        
        def objective(alpha_flat):
            alpha = alpha_flat.reshape((n_samples, K))
            
            # First term: sum of alpha * (1 - delta)
            term1 = 0
            for i in range(n_samples):
                for k in range(K):
                    if k != y[i]:
                        term1 += alpha[i, k]
            
            # Second term: quadratic
            term2 = 0
            for k in range(K):
                # Compute b_ik for this class
                b_k = np.zeros(n_samples)
                for i in range(n_samples):
                    sum_alpha_i = np.sum(alpha[i])
                    b_k[i] = alpha[i, k] - (1 if y[i] == k else 0) * sum_alpha_i
                
                # ||sum_i b_ik x_i||^2 = b_k^T K b_k
                term2 += b_k @ self.K_matrix @ b_k
            
            # Dual maximization = minimize negative
            return -(term1 - 0.5 * term2)
        
        # This is a simplified version; production code uses specialized solvers
        print("Solving Crammer-Singer dual optimization...")
        result = minimize(
            objective, 
            alpha_flat,
            method='L-BFGS-B',
            bounds=[(0, self.C) for _ in range(n_samples * K)],
            options={'maxiter': 1000}
        )
        
        self.alpha = result.x.reshape((n_samples, K))
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class labels.
        """
        # Compute kernel between test and training points
        K_test = self._compute_kernel(X, self.X_train)
        
        n_samples = X.shape[0]
        n_train = self.X_train.shape[0]
        K = self.n_classes
        
        # Compute w_k^T x for each class via kernel expansion
        scores = np.zeros((n_samples, K))
        
        for k in range(K):
            # b_ik for training points
            b_k = np.zeros(n_train)
            for i in range(n_train):
                sum_alpha_i = np.sum(self.alpha[i])
                b_k[i] = self.alpha[i, k] - (1 if self.y_train[i] == k else 0) * sum_alpha_i
            
            # Score for class k: sum_i b_ik K(x, x_i)
            scores[:, k] = K_test @ b_k
        
        return self.classes_[np.argmax(scores, axis=1)]

Optimization Algorithms

1. Sequential Minimal Optimization (SMO) Extension

Modified SMO for Crammer-Singer:

Select a training point $i$ that violates KKT conditions
Select two classes $k$ and $l$ with the largest gradient difference
Update $\alpha_{ik}$ and $\alpha_{il}$ while keeping $\alpha_{ik} + \alpha_{il}$ constant
Repeat until convergence

This approach, used in LIBLINEAR, achieves near-optimal efficiency for medium-sized problems.

Working Set Selection

Efficient working set selection is crucial for SMO-like algorithms. Heuristics like Maximum Violating Pair (selecting the two classes with largest KKT violation) significantly accelerate convergence.

2. Stochastic Gradient Descent (SGD)

For large-scale problems, SGD on the primal is often most practical:

Update Rule:

For each randomly sampled $(\mathbf{x}_i, y_i)$:

Find the violating class: $k^* = \arg\max_{k \neq y_i} (1 + \mathbf{w}_k^\top \mathbf{x}i - \mathbf{w}{y_i}^\top \mathbf{x}_i)$
If margin violated ($1 + \mathbf{w}_{k^*}^\top \mathbf{x}i - \mathbf{w}{y_i}^\top \mathbf{x}_i > 0$):
- Update: $\mathbf{w}{k^*} \leftarrow \mathbf{w}{k^*} - \eta \mathbf{x}_i$
- Update: $\mathbf{w}{y_i} \leftarrow \mathbf{w}{y_i} + \eta \mathbf{x}_i$
Regularization update: $\mathbf{w}_k \leftarrow (1 - \eta \lambda) \mathbf{w}_k$ for all $k$

3. Coordinate Descent

Optimize one coordinate of $\mathbf{w}$ at a time, cycling through all $(d \times K)$ coordinates. Efficient for sparse data where coordinate updates affect few constraints.

4. PEGASOS-style Algorithm

Primal Estimated sub-GrAdient SOlver for SVM, extended to multi-class:

$$\mathbf{w}_k^{(t+1)} = \frac{1}{\lambda (t+1)} \left[ \mathbf{w}k^{(t)} - \eta_t \nabla_k L{CS}(\mathbf{x}_t, y_t) \right]$$

with projection to enforce bounded norm.

crammer_singer_sgd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
 
class CrammerSingerSGD:
    """
    Crammer-Singer SVM via Stochastic Gradient Descent.
    
    Efficient for large-scale multi-class classification.
    """
    
    def __init__(
        self, 
        n_classes: int,
        n_features: int,
        C: float = 1.0,
        learning_rate: float = 0.01,
        n_epochs: int = 100,
        decay: float = 0.01
    ):
        self.n_classes = n_classes
        self.n_features = n_features
        self.C = C
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.decay = decay
        self.lambda_reg = 1 / (C * n_features)  # Regularization strength
        
        # Initialize weights: (n_features, n_classes)
        self.W = np.zeros((n_features, n_classes))
        
    def _compute_loss_and_update(
        self, 
        x: np.ndarray, 
        y: int, 
        lr: float
    ) -> float:
        """
        Compute loss for single sample and update weights.
        
        Returns loss value for monitoring.
        """
        # Compute scores for all classes
        scores = x @ self.W  # (n_classes,)
        
        # Find the most violating class (not y)
        margins = 1 + scores - scores[y]
        margins[y] = -np.inf  # Exclude correct class
        k_star = np.argmax(margins)
        margin_violation = margins[k_star]
        
        # Compute hinge loss
        loss = max(0, margin_violation)
        
        if loss > 0:
            # Gradient update: push w_y higher, push w_k* lower
            self.W[:, y] += lr * x
            self.W[:, k_star] -= lr * x
        
        # Regularization (weight decay)
        self.W *= (1 - lr * self.lambda_reg)
        
        return loss
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'CrammerSingerSGD':
        """
        Train the classifier using SGD.
        """
        n_samples = X.shape[0]
        
        for epoch in range(self.n_epochs):
            # Shuffle data
            indices = np.random.permutation(n_samples)
            
            epoch_loss = 0.0
            for idx in indices:
                # Learning rate schedule
                t = epoch * n_samples + idx + 1
                lr = self.learning_rate / (1 + self.decay * t)
                
                loss = self._compute_loss_and_update(X[idx], y[idx], lr)
                epoch_loss += loss
            
            avg_loss = epoch_loss / n_samples
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch + 1}/{self.n_epochs}, "
                      f"Avg Loss: {avg_loss:.4f}")
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        scores = X @ self.W
        return np.argmax(scores, axis=1)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Compute class scores (not probabilities, but proportional).
        
        Use softmax for calibrated probabilities.
        """
        scores = X @ self.W
        # Softmax normalization
        exp_scores = np.exp(scores - np.max(scores, axis=1, keepdims=True))
        return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 
 
def train_crammer_singer_mini_batch(
    X: np.ndarray,
    y: np.ndarray,
    n_classes: int,
    batch_size: int = 32,
    C: float = 1.0,
    n_epochs: int = 100
) -> np.ndarray:
    """
    Mini-batch SGD for Crammer-Singer.
    
    More stable than pure SGD due to gradient averaging.
    """
    n_samples, n_features = X.shape
    W = np.zeros((n_features, n_classes))
    lambda_reg = 1 / (C * n_samples)
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        
        for batch_start in range(0, n_samples, batch_size):
            batch_end = min(batch_start + batch_size, n_samples)
            batch_idx = indices[batch_start:batch_end]
            
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]
            actual_batch = len(batch_idx)
            
            # Compute gradient for batch
            gradient = np.zeros_like(W)
            
            for i in range(actual_batch):
                scores = X_batch[i] @ W
                margins = 1 + scores - scores[y_batch[i]]
                margins[y_batch[i]] = -np.inf
                k_star = np.argmax(margins)
                
                if margins[k_star] > 0:
                    gradient[:, y_batch[i]] -= X_batch[i]
                    gradient[:, k_star] += X_batch[i]
            
            gradient /= actual_batch
            
            # Learning rate
            t = epoch * (n_samples // batch_size) + batch_start // batch_size + 1
            lr = 1.0 / (lambda_reg * t)
            
            # Update with regularization
            W = (1 - lr * lambda_reg) * W - lr * gradient
    
    return W

Theoretical Properties and Generalization

The Crammer-Singer formulation has strong theoretical foundations with provable generalization bounds.

Generalization Bound (Crammer & Singer, 2001):

With probability at least $1 - \delta$, the generalization error of the Crammer-Singer classifier is bounded by:

$$\text{err}(f) \leq \frac{1}{n} \sum_i \mathbb{1}[\gamma_i < 1] + O\left( \sqrt{\frac{R^2 |\mathbf{W}|_F^2 + \log(1/\delta)}{n}} \right)$$

where $R$ is the maximum norm of training examples and $|\mathbf{W}|_F$ is the Frobenius norm of the weight matrix.

Interpretation:

First term: empirical margin error (fraction with margin < 1)
Second term: complexity penalty depending on weight magnitude
Larger margins (smaller weights for same decision) → better generalization

Why Joint Optimization Matters

Comparison with Decomposition Methods:

Aspect	Crammer-Singer	OvO/OvA
Optimizes	Global multi-class margin	Individual binary margins
Generalization bound	Direct (joint)	Indirect (per-classifier)
Parameter coupling	Yes (all K classes)	No (independent)
Theoretical elegance	High	Moderate

Margin Definition Matters:

Crammer-Singer's margin is the gap between correct and best incorrect class. This is a more demanding criterion than OvA (correct class beats "rest") or OvO (each pair separately).

Fisher Consistency:

The Crammer-Singer loss is Fisher consistent: as $n \to \infty$, the minimizer of the empirical risk converges to the Bayes optimal classifier (assuming appropriate model capacity).

Comparison with Weston-Watkins:

Another multi-class SVM formulation (Weston & Watkins, 1999) uses a slack variable $\xi_{ik}$ for each (point, class) pair:

$$\min \frac{1}{2} \sum_k |\mathbf{w}k|^2 + C \sum{i,k} \xi_{ik}$$

subject to: $\mathbf{w}_{y_i}^\top \mathbf{x}_i - \mathbf{w}_k^\top \mathbf{x}i \geq 1 - \xi{ik}$

Weston-Watkins penalizes all margin violations; Crammer-Singer penalizes only the maximum violation. Crammer-Singer is generally preferred as it focuses on the hardest competitor.

Multi-class SVM Formulation Comparison
Formulation	Slack Type	Penalty Focus	Complexity
Crammer-Singer	One per point	Max violation	Lower
Weston-Watkins	One per (point, class)	All violations	Higher
Lee et al. (2004)	One per point (sum)	Sum of violations	Medium

When to Use Crammer-Singer vs Decomposition

Given the theoretical elegance of Crammer-Singer, why isn't it always used? The answer lies in practical tradeoffs.

Empirical Observations:

Accuracy: In practice, well-tuned OvO/OvA often achieve comparable or even better accuracy than Crammer-Singer on many datasets. The theoretical advantages don't always translate to significant practical gains.
Training speed: OvO trains K(K-1)/2 small problems in parallel; Crammer-Singer solves one large coupled problem. For large n and K, decomposition can be faster.
Scalability: Decomposition methods scale linearly with number of processors (embarrassingly parallel). Crammer-Singer has limited parallelization opportunities.
Library support: OvO/OvA are the defaults in sklearn, LIBSVM, and most libraries. Crammer-Singer requires specialized solvers.

Choose Crammer-Singer When

•Theoretical guarantees matter: Research, high-stakes applications
•Classes are highly confusable: Joint optimization helps distinguish similar classes
•Fine margin control needed: Explicit control over multi-class margin
•K is moderate: Computational complexity is manageable
•Unified model desired: Single optimization, coherent weight structure

Choose Decomposition When

•K is large (>100): Scalability becomes critical
•Parallel training available: Can utilize multi-core/distributed systems
•Rapid iteration needed: Faster development cycles with off-the-shelf tools
•Interpretability important: Binary classifiers easier to analyze
•Established pipelines exist: Easier integration with existing code

The Pragmatic Choice

Practical Implementation Notes

Implementing Crammer-Singer efficiently requires attention to several practical considerations:

1. Initialization:

Weight initialization affects convergence. Common strategies:

Zero initialization: Simple, often works well with regularization
Small random: Breaks symmetry, may help for some problems
OvA warm-start: Initialize with OvA solution, refine with joint optimization

2. Feature Preprocessing:

Standardization: Critical for gradient-based methods; ensures balanced updates across features
Sparse handling: For text/sparse data, use sparse matrix operations throughout

3. Regularization Tuning:

The C parameter (or equivalently, λ = 1/C) requires careful tuning:

Too small C: Underfitting, all margins violated
Too large C: Overfitting, memorizes training data
Use k-fold cross-validation on training data

crammer_singer_practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
 
class PracticalCrammerSinger(BaseEstimator, ClassifierMixin):
    """
    Production-ready Crammer-Singer SVM with sklearn compatibility.
    
    Features:
    - Automatic label encoding
    - Feature scaling option
    - Cross-validation for C selection
    - Warm-start from OvA
    """
    
    def __init__(
        self,
        C: float = 1.0,
        learning_rate: float = 0.01,
        n_epochs: int = 100,
        scale_features: bool = True,
        warm_start: bool = False,
        random_state: int = 42
    ):
        self.C = C
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.scale_features = scale_features
        self.warm_start = warm_start
        self.random_state = random_state
        
    def fit(self, X, y):
        """
        Fit the Crammer-Singer SVM.
        """
        np.random.seed(self.random_state)
        
        # Encode labels
        self.label_encoder_ = LabelEncoder()
        y_encoded = self.label_encoder_.fit_transform(y)
        self.classes_ = self.label_encoder_.classes_
        self.n_classes_ = len(self.classes_)
        
        # Scale features
        if self.scale_features:
            self.scaler_ = StandardScaler()
            X_scaled = self.scaler_.fit_transform(X)
        else:
            X_scaled = X
        
        n_samples, n_features = X_scaled.shape
        
        # Initialize weights
        if self.warm_start:
            # Initialize from OvA solution
            self.W_ = self._init_from_ova(X_scaled, y_encoded)
        else:
            self.W_ = np.zeros((n_features, self.n_classes_))
        
        # Train with SGD
        lambda_reg = 1 / (self.C * n_samples)
        
        for epoch in range(self.n_epochs):
            indices = np.random.permutation(n_samples)
            
            for idx in indices:
                x = X_scaled[idx]
                yi = y_encoded[idx]
                
                # Learning rate decay
                t = epoch * n_samples + idx + 1
                lr = self.learning_rate / np.sqrt(t)
                
                # Forward pass
                scores = x @ self.W_
                margins = 1 + scores - scores[yi]
                margins[yi] = -np.inf
                k_star = np.argmax(margins)
                
                # Update if margin violated
                if margins[k_star] > 0:
                    self.W_[:, yi] += lr * x
                    self.W_[:, k_star] -= lr * x
                
                # Regularization
                self.W_ *= (1 - lr * lambda_reg)
        
        return self
    
    def _init_from_ova(self, X, y):
        """
        Initialize weights using OvA classifiers.
        """
        from sklearn.svm import LinearSVC
        
        n_features = X.shape[1]
        W = np.zeros((n_features, self.n_classes_))
        
        for k in range(self.n_classes_):
            y_binary = (y == k).astype(int) * 2 - 1  # {-1, +1}
            clf = LinearSVC(C=self.C, max_iter=100)
            clf.fit(X, y_binary)
            W[:, k] = clf.coef_.flatten()
        
        return W
    
    def predict(self, X):
        """Predict class labels."""
        if self.scale_features:
            X_scaled = self.scaler_.transform(X)
        else:
            X_scaled = X
        
        scores = X_scaled @ self.W_
        y_pred_encoded = np.argmax(scores, axis=1)
        return self.label_encoder_.inverse_transform(y_pred_encoded)
    
    def score(self, X, y):
        """Accuracy score."""
        return np.mean(self.predict(X) == y)
 
 
def tune_crammer_singer_c(X, y, C_values=None, cv=5):
    """
    Tune C parameter via cross-validation.
    """
    if C_values is None:
        C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
    
    best_C = None
    best_score = 0
    
    for C in C_values:
        clf = PracticalCrammerSinger(C=C, n_epochs=50)
        scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
        mean_score = np.mean(scores)
        
        print(f"C={C:>6}, CV Accuracy: {mean_score:.4f} (+/- {np.std(scores):.4f})")
        
        if mean_score > best_score:
            best_score = mean_score
            best_C = C
    
    print(f"\nBest C: {best_C}, Best CV Accuracy: {best_score:.4f}")
    return best_C

Summary: Crammer-Singer Formulation

Key Takeaways

•Unified optimization — Single problem considers all K classes jointly, not decomposed into binary subproblems.
•Multi-class margin — Maximizes gap between correct class and the best (most confusing) incorrect class.
•Single slack per point — Penalizes maximum violation, focusing on the hardest competitor.
•Kernelizable — Dual formulation enables non-linear classification via kernel functions.
•Strong theory — Direct generalization bounds on the joint weight structure.
•SGD solvable — Efficient training via stochastic optimization for large-scale problems.
•Practical tradeoff — Elegance vs. scalability; decomposition often wins in practice.

Page Complete

What's Next:

4 / 5