Supervised Learning - ClassificationMulti-class Logistic Regression

Multi-class Logistic Regression

LevelIntermediate

Duration90 mins

TopicMulti-class Logistic Regression

4 / 5

One-vs-All Strategy

Decomposing Multi-class into Binary Problems

Multinomial logistic regression elegantly handles all $K$ classes simultaneously through the softmax function. But there's an alternative philosophy: decompose the multi-class problem into multiple binary classification problems.

The one-vs-all (OvA) strategy—also known as one-vs-rest (OvR)—is the simplest and most widely used decomposition method. It trains $K$ independent binary classifiers, each distinguishing one class from all others.

This approach has historical significance (predating efficient multinomial methods), practical utility (embarrassingly parallel training), and theoretical interest (different inductive biases). Understanding OvA deepens appreciation for the design choices underlying multi-class classification.

What You Will Learn

By the end of this page, you will understand: how to construct and train OvA classifiers; prediction strategies for combining binary outputs; theoretical comparison with multinomial methods; the decision boundary geometry; handling of imbalanced decomposed problems; and when to prefer OvA versus softmax approaches.

The One-vs-All Strategy

Core Idea

Given $K$ classes ${1, 2, \ldots, K}$, train $K$ binary classifiers:

Classifier $f_1$: Class 1 (positive) vs. Classes 2, 3, ..., K (negative)
Classifier $f_2$: Class 2 (positive) vs. Classes 1, 3, ..., K (negative)
...
Classifier $f_K$: Class K (positive) vs. Classes 1, 2, ..., K-1 (negative)

Each classifier $f_k$ answers: 'Does this example belong to class $k$ or not?'

Training Data Transformation

For training classifier $f_k$: $$y_i^{(k)} = \begin{cases} +1 & \text{if } y_i = k \ -1 & \text{if } y_i eq k \end{cases}$$

All original training examples are used, with relabeled targets. Classifier $f_k$ sees:

$n_k$ positive examples (original class $k$)
$n - n_k$ negative examples (all other classes combined)

Base Classifier Choice

OvA works with any binary classifier:

Logistic regression: Outputs probability $P(y=k|\mathbf{x})$
SVM: Outputs signed distance to margin
Decision tree: Outputs class prediction or probabilities
Neural network: Binary output layer

The choice affects both training efficiency and prediction strategy.

OvA with Logistic Regression

Each classifier $f_k$ is a binary logistic regression: $$f_k(\mathbf{x}) = \sigma(\mathbf{w}_k^T \mathbf{x} + b_k) = \frac{1}{1 + e^{-(\mathbf{w}_k^T \mathbf{x} + b_k)}}$$

This outputs $P(y = k | \mathbf{x})$ under the model that assumes a binary world of class $k$ vs. 'everything else.'

Parameter Count:

OvA with logistic regression: $K \cdot (d + 1)$ parameters
Multinomial logistic (reference class): $(K-1) \cdot (d + 1)$ parameters

OvA has slightly more parameters, but they're estimated independently.

Independence is Double-Edged

OvA trains $K$ classifiers independently—no shared information flow. This enables parallel training but means each classifier is oblivious to the task of distinguishing between non-target classes. The multinomial approach, by contrast, jointly learns all class relationships.

Prediction Strategies

Given $K$ trained classifiers ${f_1, \ldots, f_K}$, how do we predict the class for a new input $\mathbf{x}$? Several strategies exist.

Strategy 1: Maximum Score (Winner-Take-All)

Predict the class whose classifier outputs the highest score:

$$\hat{y} = \arg\max_{k \in {1, \ldots, K}} f_k(\mathbf{x})$$

For logistic regression classifiers, $f_k(\mathbf{x}) = \sigma(z_k)$ where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$.

Since $\sigma$ is monotonic, this is equivalent to:

$$\hat{y} = \arg\max_{k} z_k = \arg\max_{k} (\mathbf{w}_k^T \mathbf{x} + b_k)$$

Observation: This is exactly the same prediction rule as multinomial logistic regression! However, the parameters $\mathbf{w}_k$ are estimated differently.

Strategy 2: Probability Normalization

The raw outputs $f_k(\mathbf{x})$ don't sum to 1 (each is an independent binary probability). To get a proper distribution, normalize:

$$P_{\text{OvA}}(y=k|\mathbf{x}) = \frac{f_k(\mathbf{x})}{\sum_{j=1}^{K} f_j(\mathbf{x})}$$

Warning: This normalization is ad-hoc. The binary probabilities aren't designed to be normalized—they answer different questions. Use with caution.

Strategy 3: Softmax on Logits

Apply softmax to the raw logits from each classifier:

$$P_{\text{OvA-softmax}}(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$.

This produces a proper probability distribution and is commonly used when probability outputs are needed from OvA classifiers.

Note: The resulting probabilities are generally not the same as those from true multinomial logistic regression, even though the formula looks identical. The difference is in how $\mathbf{w}_k$ were trained.

ova_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
 
class OneVsAllClassifier:
    """
    One-vs-All multi-class classifier using logistic regression.
    """
    
    def __init__(self, C=1.0, max_iter=1000):
        self.C = C
        self.max_iter = max_iter
        self.classifiers = []
        self.classes_ = None
        
    def fit(self, X, y):
        """
        Train K binary classifiers.
        
        Args:
            X: Features (n_samples, n_features)
            y: Labels (n_samples,)
        """
        self.classes_ = np.unique(y)
        K = len(self.classes_)
        
        self.classifiers = []
        
        for k, class_label in enumerate(self.classes_):
            # Create binary labels: class k = 1, all others = 0
            y_binary = (y == class_label).astype(int)
            
            # Train binary logistic regression
            clf = LogisticRegression(
                C=self.C, 
                max_iter=self.max_iter,
                solver='lbfgs',
                random_state=42
            )
            clf.fit(X, y_binary)
            self.classifiers.append(clf)
            
            print(f"Trained classifier {k+1}/{K}: "
                  f"Class '{class_label}' vs rest")
        
        return self
    
    def predict(self, X):
        """
        Predict class labels using maximum score strategy.
        """
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]
    
    def decision_function(self, X):
        """
        Get raw scores (logits) from each classifier.
        
        Returns:
            scores: (n_samples, K) matrix
        """
        scores = np.zeros((X.shape[0], len(self.classes_)))
        for k, clf in enumerate(self.classifiers):
            # Use decision function (raw logit) not probability
            scores[:, k] = clf.decision_function(X)
        return scores
    
    def predict_proba_raw(self, X):
        """
        Get raw probabilities from each binary classifier.
        Note: These don't sum to 1!
        """
        probs = np.zeros((X.shape[0], len(self.classes_)))
        for k, clf in enumerate(self.classifiers):
            probs[:, k] = clf.predict_proba(X)[:, 1]  # P(class k)
        return probs
    
    def predict_proba_normalized(self, X):
        """
        Get probabilities normalized to sum to 1.
        """
        probs = self.predict_proba_raw(X)
        return probs / probs.sum(axis=1, keepdims=True)
    
    def predict_proba_softmax(self, X):
        """
        Apply softmax to logits for probability output.
        """
        logits = self.decision_function(X)
        exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True))
        return exp_logits / exp_logits.sum(axis=1, keepdims=True)
 
# Comparison: OvA vs Multinomial
def compare_ova_multinomial(X, y, test_size=0.2):
    """
    Compare OvA and multinomial logistic regression.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=y
    )
    
    # Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # OvA classifier
    print("
=== Training One-vs-All ===")
    ova = OneVsAllClassifier(C=1.0)
    ova.fit(X_train_scaled, y_train)
    ova_pred = ova.predict(X_test_scaled)
    ova_acc = accuracy_score(y_test, ova_pred)
    
    # Multinomial classifier
    print("
=== Training Multinomial ===")
    multi = LogisticRegression(
        C=1.0, 
        multi_class='multinomial',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    )
    multi.fit(X_train_scaled, y_train)
    multi_pred = multi.predict(X_test_scaled)
    multi_acc = accuracy_score(y_test, multi_pred)
    
    print(f"
=== Results ===")
    print(f"OvA Accuracy:         {ova_acc:.4f}")
    print(f"Multinomial Accuracy: {multi_acc:.4f}")
    
    # Compare probability outputs
    print("
=== Probability Comparison (first 3 samples) ===")
    ova_probs = ova.predict_proba_softmax(X_test_scaled[:3])
    multi_probs = multi.predict_proba(X_test_scaled[:3])
    
    print("OvA (softmax on logits):")
    print(ova_probs.round(3))
    print("
Multinomial:")
    print(multi_probs.round(3))
    
    return ova, multi
 
# Example
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Generate multi-class data
    X, y = make_classification(
        n_samples=1000,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_classes=4,
        n_clusters_per_class=1,
        random_state=42
    )
    
    ova, multi = compare_ova_multinomial(X, y)

Decision Boundary Geometry

Both OvA and multinomial logistic regression produce linear decision boundaries, but with subtle differences in their structure.

Multinomial Decision Boundaries

Recall that the boundary between classes $k$ and $l$ in multinomial logistic regression is: $$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$

All $\binom{K}{2}$ pairwise boundaries pass through a common central point in weight space. The decision regions form a Voronoi-like partition with boundaries radiating from a common center.

OvA Decision Boundaries

In OvA, the boundary between classes $k$ and $l$ occurs where their scores are equal: $$f_k(\mathbf{x}) = f_l(\mathbf{x})$$

For logistic regression: $z_k = z_l$, giving: $$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$

This looks identical—but the parameter values differ because of how they were trained.

Key Geometric Differences

Boundary Orientation: OvA boundaries may not pass through a common point, creating more 'irregular' decision regions.
Undefined Regions: OvA can create regions where multiple classifiers claim the point (all say 'yes') or no classifier claims it (all say 'no'). The max-score rule resolves these ambiguities.
Margin Properties: Each OvA classifier has its own margin (defined relative to class $k$ vs. all others). These margins aren't coordinated—one class might have wide margins while another is tightly bounded.

The 'Ties' Issue

In principle, ties can occur where $f_k(\mathbf{x}) = f_l(\mathbf{x})$. In practice:

Continuous features: Ties have measure zero (probability zero)
Score as real numbers: Exact ties are extremely rare
Resolution: Often break ties by lower class index or random selection

Visualizing the Difference

In 2D with 3 classes, multinomial boundaries typically form a 'Y' shape meeting at one point. OvA boundaries might form an 'irregular Y' that doesn't meet perfectly, or might create a small triangular region where the max-score decision differs from what a multinomial model would produce.

The Imbalance Problem in OvA

OvA inherently creates imbalanced training problems—a critical issue that requires careful handling.

The Inherent Imbalance

For classifier $f_k$ distinguishing class $k$ from all others:

Positive examples: $n_k$ (samples of class $k$)
Negative examples: $n - n_k$ (all other samples)

With $K$ balanced classes (each with $n/K$ samples):

Positive: $n/K$ samples
Negative: $n \cdot (K-1)/K$ samples
Imbalance ratio: $(K-1) : 1$

For $K=10$ classes: 9:1 imbalance. For $K=100$: 99:1 imbalance!

Effects of Imbalance:

Bias toward negative class: Classifiers tend to predict 'not class $k$' too often
Poor calibration: Probability estimates systematically underestimate $P(y=k)$
Reduced recall: True positives for minority class (which is now every class!) decrease

Imbalance Severity as K Increases
Number of Classes (K)	Imbalance Ratio	Effect
2	1:1	No imbalance (standard binary)
3	2:1	Mild imbalance
10	9:1	Significant imbalance
100	99:1	Severe imbalance
1000	999:1	Extreme imbalance

Mitigation Strategies

1. Class Weighting

Weight positive examples higher to balance effective sample sizes: $$w_{+} = \frac{n - n_k}{n_k}, \quad w_{-} = 1$$

This makes the effective positive count equal to negative count.

2. Resampling

Oversampling positives: Duplicate positive examples (with or without synthetic generation like SMOTE)
Undersampling negatives: Randomly exclude negative examples

3. Threshold Adjustment

Instead of predicting positive when $f_k(\mathbf{x}) > 0.5$, lower the threshold: $$\text{Predict class } k \text{ if } f_k(\mathbf{x}) > \tau_k$$

Set $\tau_k$ to achieve desired precision/recall balance, e.g., using ROC analysis on validation set.

4. Platt Scaling

Post-hoc recalibration of probabilities using logistic regression on held-out validation set outputs.

Multinomial Avoids This Problem

Multinomial logistic regression doesn't suffer from artificial imbalance—it models all classes jointly using their natural frequencies. This is one reason modern libraries often default to 'multinomial' rather than 'ovr' for multi-class logistic regression.

Theoretical Comparison: OvA vs. Multinomial

When is OvA preferable to multinomial, and vice versa? Let's examine the theoretical tradeoffs.

Consistency Analysis

A classifier is Bayes consistent if it converges to the Bayes optimal classifier as training data grows to infinity.

Result (Rifkin & Klautau, 2004): Under mild conditions, OvA with consistent binary classifiers is also consistent for multi-class classification.

Intuitively: If each $f_k$ correctly identifies class $k$ vs. others, the max-score rule correctly identifies the true class.

However: Consistency is an asymptotic property. In finite samples, the approaches can differ significantly.

Error Decomposition

For OvA, two types of errors occur:

Within-classifier errors: $f_k$ mistakes a class-$k$ example for 'other'
Between-classifier conflicts: Multiple $f_j$ compete for the same region

Multinomial logistic regression, by modeling all classes jointly, optimizes a single coherent objective that balances these concerns.

OvA Advantages

•Parallelism: Train $K$ classifiers independently on separate machines
•Modularity: Add new class by training one new classifier
•Any base learner: Works with SVMs, trees, any binary method
•Interpretability: Each $f_k$ has clear meaning
•Simpler optimization: $K$ small problems vs. one larger problem

Multinomial Advantages

•Joint modeling: Learns class relationships directly
•No artificial imbalance: Uses natural class frequencies
•Calibrated probabilities: Outputs sum to 1 by construction
•Statistical efficiency: Shares information across classes
•Neural network standard: Softmax is the default output layer

Empirical Findings

Research (Rifkin & Klautau, 2004; Hsu & Lin, 2002) shows:

For well-calibrated base classifiers (logistic regression), multinomial is often slightly better due to joint optimization
For margin-based classifiers (SVM), OvA often performs comparably or better because SVM focuses on margin, not probability calibration
As K increases, the imbalance problem in OvA becomes more severe, favoring multinomial
For hierarchical class structures, OvA can be more natural (e.g., one classifier for 'animal' vs 'vehicle', then refine)

sklearn Defaults:

LogisticRegression: Defaults to multi_class='auto', which chooses 'multinomial' for lbfgs/sag solvers, 'ovr' for liblinear
SVC: decision_function_shape='ovr' by default

Alternative: One-vs-One (OvO)

For completeness, let's briefly examine one-vs-one (OvO), another decomposition strategy.

OvO Strategy

Train a binary classifier for every pair of classes:

Classifier $f_{kl}$: Class $k$ vs. Class $l$ (using only examples from these two classes)
Total classifiers: $\binom{K}{2} = \frac{K(K-1)}{2}$

Training: For classifier $f_{kl}$, use only examples where $y \in {k, l}$.

Positive: Class $k$ examples
Negative: Class $l$ examples
Perfectly balanced if classes have equal size!

Prediction (Voting):

For a new input $\mathbf{x}$, each classifier $f_{kl}$ votes for either $k$ or $l$: $$\hat{y} = \arg\max_k \sum_{l eq k} \mathbf{1}[f_{kl}(\mathbf{x}) \text{ votes for } k]$$

The class with the most votes wins.

OvA vs. OvO Comparison
Aspect	One-vs-All (OvA)	One-vs-One (OvO)
Number of classifiers	$K$	$K(K-1)/2$
Training set per classifier	$n$ (all samples)	$n_k + n_l$ (2 classes)
Inherent imbalance	Yes (severe for large K)	No (each is 2-class)
Training cost	$O(K \cdot n \cdot d)$	$O(K^2 \cdot \bar{n} \cdot d)$ where $\bar{n} \ll n$
Prediction cost	$O(K)$	$O(K^2)$
Memory	$O(K \cdot d)$	$O(K^2 \cdot d)$

When to Use OvO:

Base classifier scales poorly with $n$: Each OvO classifier trains on much less data. For SVMs with $O(n^2)$ to $O(n^3)$ complexity, OvO can be faster overall.
Balanced problems preferred: OvO naturally creates balanced binary problems.
K is small: The $O(K^2)$ classifier count is acceptable for small $K$.

Challenges:

Voting ties: Need tie-breaking strategy
Many classifiers: For $K=100$, need 4,950 classifiers
Small training sets: Each classifier sees only 2 classes' data

sklearn SVM: SVC uses OvO internally by default (the 'ovo' implementation from libsvm).

ECOC: Error-Correcting Output Codes

A generalization of OvA and OvO where classes are encoded as binary vectors (codewords). Each column of the code matrix defines a binary classification task. Prediction decodes by finding the closest codeword. ECOC can provide error-correction properties, but is more complex to design and implement.

Practical Implementation Guidelines

Let's consolidate practical guidance for implementing multi-class classification.

Decision Flowchart:

Is base learner logistic regression or neural network? → Use multinomial (softmax) by default. It's simpler, avoids imbalance, better calibrated.
Is base learner SVM? → Consider OvO (sklearn default) for small-medium K → Consider OvA for large K where $O(K^2)$ classifiers is expensive
Need to incrementally add classes? → OvA is more modular (train one new classifier)
Need parallel training across machines? → OvA is embarrassingly parallel
Very large K (1000+ classes)? → Multinomial with hierarchical softmax or approximate methods

multiclass_sklearn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
 
def compare_multiclass_strategies(X, y, cv=5):
    """
    Compare different multi-class strategies on a dataset.
    
    Args:
        X: Features
        y: Multi-class labels
        cv: Cross-validation folds
    """
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    results = {}
    
    # 1. Multinomial Logistic Regression
    lr_multi = LogisticRegression(
        multi_class='multinomial',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    )
    scores = cross_val_score(lr_multi, X_scaled, y, cv=cv)
    results['LR Multinomial'] = scores
    print(f"LR Multinomial: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 2. OvR Logistic Regression
    lr_ovr = LogisticRegression(
        multi_class='ovr',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    )
    scores = cross_val_score(lr_ovr, X_scaled, y, cv=cv)
    results['LR OvR'] = scores
    print(f"LR OvR:         {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 3. OvR with class weights for balance
    lr_ovr_balanced = LogisticRegression(
        multi_class='ovr',
        solver='lbfgs',
        max_iter=1000,
        class_weight='balanced',
        random_state=42
    )
    scores = cross_val_score(lr_ovr_balanced, X_scaled, y, cv=cv)
    results['LR OvR Balanced'] = scores
    print(f"LR OvR (balanced): {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 4. SVM with OvO (default)
    svm_ovo = SVC(kernel='rbf', random_state=42)
    scores = cross_val_score(svm_ovo, X_scaled, y, cv=cv)
    results['SVM OvO'] = scores
    print(f"SVM OvO:        {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 5. SVM with OvR (wrapped)
    svm_ovr = OneVsRestClassifier(SVC(kernel='rbf', random_state=42))
    scores = cross_val_score(svm_ovr, X_scaled, y, cv=cv)
    results['SVM OvR'] = scores
    print(f"SVM OvR:        {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    return results
 
# Example
if __name__ == "__main__":
    from sklearn.datasets import load_digits
    
    # Load multi-class dataset (10 classes: digits 0-9)
    digits = load_digits()
    X, y = digits.data, digits.target
    
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, "
          f"{len(np.unique(y))} classes")
    print()
    
    results = compare_multiclass_strategies(X, y)
    
    print("
=== Key Observations ===")
    print("- Multinomial typically performs best for logistic regression")
    print("- OvR with balanced weights helps with imbalanced decomposition")
    print("- SVM OvO vs OvR depends on dataset; OvO often slightly better")

Summary: One-vs-All Strategy

We have thoroughly explored the one-vs-all strategy for multi-class classification. Let's consolidate the key insights:

Key Takeaways

•OvA Approach: Train $K$ binary classifiers, each distinguishing one class from all others.
•Prediction: Use max-score rule; can apply softmax to logits for probability output.
•Inherent Imbalance: Each classifier faces $(K-1):1$ imbalance—address with class weights or resampling.
•Decision Boundaries: Both OvA and multinomial create linear boundaries, but parameter estimation differs.
•Advantages: Parallelizable training, modular class addition, works with any binary learner.
•Disadvantages: Artificial imbalance, no joint class modeling, inconsistent probability outputs.
•Alternative—OvO: Trains $K(K-1)/2$ classifiers, each on two classes; naturally balanced but scales poorly with $K$.
•Practical Guidance: Prefer multinomial for logistic regression; consider OvA/OvO for SVMs based on $K$ and $n$.

What's Next:

With the model, loss, and multi-class strategies established, we turn to computational considerations—the practical aspects of training multinomial logistic regression at scale, including optimization algorithms, mini-batch training, and efficiency considerations that enable handling millions of samples and thousands of classes.

Page Complete

You now understand the one-vs-all decomposition strategy—its construction, prediction methods, geometric properties, and tradeoffs compared to multinomial approaches. This knowledge enables informed choice of classification strategy based on problem characteristics.

4 / 5

Loading learning content...

Supervised Learning - ClassificationMulti-class Logistic Regression

Multi-class Logistic Regression

LevelIntermediate

Duration90 mins

TopicMulti-class Logistic Regression

4 / 5

One-vs-All Strategy

Decomposing Multi-class into Binary Problems

What You Will Learn

The One-vs-All Strategy

Core Idea

Given $K$ classes ${1, 2, \ldots, K}$, train $K$ binary classifiers:

Classifier $f_1$: Class 1 (positive) vs. Classes 2, 3, ..., K (negative)
Classifier $f_2$: Class 2 (positive) vs. Classes 1, 3, ..., K (negative)
...
Classifier $f_K$: Class K (positive) vs. Classes 1, 2, ..., K-1 (negative)

Each classifier $f_k$ answers: 'Does this example belong to class $k$ or not?'

Training Data Transformation

For training classifier $f_k$: $$y_i^{(k)} = \begin{cases} +1 & \text{if } y_i = k \ -1 & \text{if } y_i eq k \end{cases}$$

All original training examples are used, with relabeled targets. Classifier $f_k$ sees:

$n_k$ positive examples (original class $k$)
$n - n_k$ negative examples (all other classes combined)

Base Classifier Choice

OvA works with any binary classifier:

Logistic regression: Outputs probability $P(y=k|\mathbf{x})$
SVM: Outputs signed distance to margin
Decision tree: Outputs class prediction or probabilities
Neural network: Binary output layer

The choice affects both training efficiency and prediction strategy.

OvA with Logistic Regression

Each classifier $f_k$ is a binary logistic regression: $$f_k(\mathbf{x}) = \sigma(\mathbf{w}_k^T \mathbf{x} + b_k) = \frac{1}{1 + e^{-(\mathbf{w}_k^T \mathbf{x} + b_k)}}$$

This outputs $P(y = k | \mathbf{x})$ under the model that assumes a binary world of class $k$ vs. 'everything else.'

Parameter Count:

OvA with logistic regression: $K \cdot (d + 1)$ parameters
Multinomial logistic (reference class): $(K-1) \cdot (d + 1)$ parameters

OvA has slightly more parameters, but they're estimated independently.

Independence is Double-Edged

Prediction Strategies

Given $K$ trained classifiers ${f_1, \ldots, f_K}$, how do we predict the class for a new input $\mathbf{x}$? Several strategies exist.

Strategy 1: Maximum Score (Winner-Take-All)

Predict the class whose classifier outputs the highest score:

$$\hat{y} = \arg\max_{k \in {1, \ldots, K}} f_k(\mathbf{x})$$

For logistic regression classifiers, $f_k(\mathbf{x}) = \sigma(z_k)$ where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$.

Since $\sigma$ is monotonic, this is equivalent to:

$$\hat{y} = \arg\max_{k} z_k = \arg\max_{k} (\mathbf{w}_k^T \mathbf{x} + b_k)$$

Observation: This is exactly the same prediction rule as multinomial logistic regression! However, the parameters $\mathbf{w}_k$ are estimated differently.

Strategy 2: Probability Normalization

The raw outputs $f_k(\mathbf{x})$ don't sum to 1 (each is an independent binary probability). To get a proper distribution, normalize:

$$P_{\text{OvA}}(y=k|\mathbf{x}) = \frac{f_k(\mathbf{x})}{\sum_{j=1}^{K} f_j(\mathbf{x})}$$

Warning: This normalization is ad-hoc. The binary probabilities aren't designed to be normalized—they answer different questions. Use with caution.

Strategy 3: Softmax on Logits

Apply softmax to the raw logits from each classifier:

$$P_{\text{OvA-softmax}}(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$.

This produces a proper probability distribution and is commonly used when probability outputs are needed from OvA classifiers.

ova_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
 
class OneVsAllClassifier:
    """
    One-vs-All multi-class classifier using logistic regression.
    """
    
    def __init__(self, C=1.0, max_iter=1000):
        self.C = C
        self.max_iter = max_iter
        self.classifiers = []
        self.classes_ = None
        
    def fit(self, X, y):
        """
        Train K binary classifiers.
        
        Args:
            X: Features (n_samples, n_features)
            y: Labels (n_samples,)
        """
        self.classes_ = np.unique(y)
        K = len(self.classes_)
        
        self.classifiers = []
        
        for k, class_label in enumerate(self.classes_):
            # Create binary labels: class k = 1, all others = 0
            y_binary = (y == class_label).astype(int)
            
            # Train binary logistic regression
            clf = LogisticRegression(
                C=self.C, 
                max_iter=self.max_iter,
                solver='lbfgs',
                random_state=42
            )
            clf.fit(X, y_binary)
            self.classifiers.append(clf)
            
            print(f"Trained classifier {k+1}/{K}: "
                  f"Class '{class_label}' vs rest")
        
        return self
    
    def predict(self, X):
        """
        Predict class labels using maximum score strategy.
        """
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]
    
    def decision_function(self, X):
        """
        Get raw scores (logits) from each classifier.
        
        Returns:
            scores: (n_samples, K) matrix
        """
        scores = np.zeros((X.shape[0], len(self.classes_)))
        for k, clf in enumerate(self.classifiers):
            # Use decision function (raw logit) not probability
            scores[:, k] = clf.decision_function(X)
        return scores
    
    def predict_proba_raw(self, X):
        """
        Get raw probabilities from each binary classifier.
        Note: These don't sum to 1!
        """
        probs = np.zeros((X.shape[0], len(self.classes_)))
        for k, clf in enumerate(self.classifiers):
            probs[:, k] = clf.predict_proba(X)[:, 1]  # P(class k)
        return probs
    
    def predict_proba_normalized(self, X):
        """
        Get probabilities normalized to sum to 1.
        """
        probs = self.predict_proba_raw(X)
        return probs / probs.sum(axis=1, keepdims=True)
    
    def predict_proba_softmax(self, X):
        """
        Apply softmax to logits for probability output.
        """
        logits = self.decision_function(X)
        exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True))
        return exp_logits / exp_logits.sum(axis=1, keepdims=True)
 
# Comparison: OvA vs Multinomial
def compare_ova_multinomial(X, y, test_size=0.2):
    """
    Compare OvA and multinomial logistic regression.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42, stratify=y
    )
    
    # Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # OvA classifier
    print("
=== Training One-vs-All ===")
    ova = OneVsAllClassifier(C=1.0)
    ova.fit(X_train_scaled, y_train)
    ova_pred = ova.predict(X_test_scaled)
    ova_acc = accuracy_score(y_test, ova_pred)
    
    # Multinomial classifier
    print("
=== Training Multinomial ===")
    multi = LogisticRegression(
        C=1.0, 
        multi_class='multinomial',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    )
    multi.fit(X_train_scaled, y_train)
    multi_pred = multi.predict(X_test_scaled)
    multi_acc = accuracy_score(y_test, multi_pred)
    
    print(f"
=== Results ===")
    print(f"OvA Accuracy:         {ova_acc:.4f}")
    print(f"Multinomial Accuracy: {multi_acc:.4f}")
    
    # Compare probability outputs
    print("
=== Probability Comparison (first 3 samples) ===")
    ova_probs = ova.predict_proba_softmax(X_test_scaled[:3])
    multi_probs = multi.predict_proba(X_test_scaled[:3])
    
    print("OvA (softmax on logits):")
    print(ova_probs.round(3))
    print("
Multinomial:")
    print(multi_probs.round(3))
    
    return ova, multi
 
# Example
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Generate multi-class data
    X, y = make_classification(
        n_samples=1000,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_classes=4,
        n_clusters_per_class=1,
        random_state=42
    )
    
    ova, multi = compare_ova_multinomial(X, y)

Decision Boundary Geometry

Both OvA and multinomial logistic regression produce linear decision boundaries, but with subtle differences in their structure.

Multinomial Decision Boundaries

Recall that the boundary between classes $k$ and $l$ in multinomial logistic regression is: $$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$

All $\binom{K}{2}$ pairwise boundaries pass through a common central point in weight space. The decision regions form a Voronoi-like partition with boundaries radiating from a common center.

OvA Decision Boundaries

In OvA, the boundary between classes $k$ and $l$ occurs where their scores are equal: $$f_k(\mathbf{x}) = f_l(\mathbf{x})$$

For logistic regression: $z_k = z_l$, giving: $$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$

This looks identical—but the parameter values differ because of how they were trained.

Key Geometric Differences

Boundary Orientation: OvA boundaries may not pass through a common point, creating more 'irregular' decision regions.
Undefined Regions: OvA can create regions where multiple classifiers claim the point (all say 'yes') or no classifier claims it (all say 'no'). The max-score rule resolves these ambiguities.
Margin Properties: Each OvA classifier has its own margin (defined relative to class $k$ vs. all others). These margins aren't coordinated—one class might have wide margins while another is tightly bounded.

The 'Ties' Issue

In principle, ties can occur where $f_k(\mathbf{x}) = f_l(\mathbf{x})$. In practice:

Continuous features: Ties have measure zero (probability zero)
Score as real numbers: Exact ties are extremely rare
Resolution: Often break ties by lower class index or random selection

Visualizing the Difference

The Imbalance Problem in OvA

OvA inherently creates imbalanced training problems—a critical issue that requires careful handling.

The Inherent Imbalance

For classifier $f_k$ distinguishing class $k$ from all others:

Positive examples: $n_k$ (samples of class $k$)
Negative examples: $n - n_k$ (all other samples)

With $K$ balanced classes (each with $n/K$ samples):

Positive: $n/K$ samples
Negative: $n \cdot (K-1)/K$ samples
Imbalance ratio: $(K-1) : 1$

For $K=10$ classes: 9:1 imbalance. For $K=100$: 99:1 imbalance!

Effects of Imbalance:

Bias toward negative class: Classifiers tend to predict 'not class $k$' too often
Poor calibration: Probability estimates systematically underestimate $P(y=k)$
Reduced recall: True positives for minority class (which is now every class!) decrease

Imbalance Severity as K Increases
Number of Classes (K)	Imbalance Ratio	Effect
2	1:1	No imbalance (standard binary)
3	2:1	Mild imbalance
10	9:1	Significant imbalance
100	99:1	Severe imbalance
1000	999:1	Extreme imbalance

Mitigation Strategies

1. Class Weighting

Weight positive examples higher to balance effective sample sizes: $$w_{+} = \frac{n - n_k}{n_k}, \quad w_{-} = 1$$

This makes the effective positive count equal to negative count.

2. Resampling

Oversampling positives: Duplicate positive examples (with or without synthetic generation like SMOTE)
Undersampling negatives: Randomly exclude negative examples

3. Threshold Adjustment

Instead of predicting positive when $f_k(\mathbf{x}) > 0.5$, lower the threshold: $$\text{Predict class } k \text{ if } f_k(\mathbf{x}) > \tau_k$$

Set $\tau_k$ to achieve desired precision/recall balance, e.g., using ROC analysis on validation set.

4. Platt Scaling

Post-hoc recalibration of probabilities using logistic regression on held-out validation set outputs.

Multinomial Avoids This Problem

Theoretical Comparison: OvA vs. Multinomial

When is OvA preferable to multinomial, and vice versa? Let's examine the theoretical tradeoffs.

Consistency Analysis

A classifier is Bayes consistent if it converges to the Bayes optimal classifier as training data grows to infinity.

Result (Rifkin & Klautau, 2004): Under mild conditions, OvA with consistent binary classifiers is also consistent for multi-class classification.

Intuitively: If each $f_k$ correctly identifies class $k$ vs. others, the max-score rule correctly identifies the true class.

However: Consistency is an asymptotic property. In finite samples, the approaches can differ significantly.

Error Decomposition

For OvA, two types of errors occur:

Within-classifier errors: $f_k$ mistakes a class-$k$ example for 'other'
Between-classifier conflicts: Multiple $f_j$ compete for the same region

Multinomial logistic regression, by modeling all classes jointly, optimizes a single coherent objective that balances these concerns.

OvA Advantages

•Parallelism: Train $K$ classifiers independently on separate machines
•Modularity: Add new class by training one new classifier
•Any base learner: Works with SVMs, trees, any binary method
•Interpretability: Each $f_k$ has clear meaning
•Simpler optimization: $K$ small problems vs. one larger problem

Multinomial Advantages

•Joint modeling: Learns class relationships directly
•No artificial imbalance: Uses natural class frequencies
•Calibrated probabilities: Outputs sum to 1 by construction
•Statistical efficiency: Shares information across classes
•Neural network standard: Softmax is the default output layer

Empirical Findings

Research (Rifkin & Klautau, 2004; Hsu & Lin, 2002) shows:

For well-calibrated base classifiers (logistic regression), multinomial is often slightly better due to joint optimization
For margin-based classifiers (SVM), OvA often performs comparably or better because SVM focuses on margin, not probability calibration
As K increases, the imbalance problem in OvA becomes more severe, favoring multinomial
For hierarchical class structures, OvA can be more natural (e.g., one classifier for 'animal' vs 'vehicle', then refine)

sklearn Defaults:

LogisticRegression: Defaults to multi_class='auto', which chooses 'multinomial' for lbfgs/sag solvers, 'ovr' for liblinear
SVC: decision_function_shape='ovr' by default

Alternative: One-vs-One (OvO)

For completeness, let's briefly examine one-vs-one (OvO), another decomposition strategy.

OvO Strategy

Train a binary classifier for every pair of classes:

Classifier $f_{kl}$: Class $k$ vs. Class $l$ (using only examples from these two classes)
Total classifiers: $\binom{K}{2} = \frac{K(K-1)}{2}$

Training: For classifier $f_{kl}$, use only examples where $y \in {k, l}$.

Positive: Class $k$ examples
Negative: Class $l$ examples
Perfectly balanced if classes have equal size!

Prediction (Voting):

For a new input $\mathbf{x}$, each classifier $f_{kl}$ votes for either $k$ or $l$: $$\hat{y} = \arg\max_k \sum_{l eq k} \mathbf{1}[f_{kl}(\mathbf{x}) \text{ votes for } k]$$

The class with the most votes wins.

OvA vs. OvO Comparison
Aspect	One-vs-All (OvA)	One-vs-One (OvO)
Number of classifiers	$K$	$K(K-1)/2$
Training set per classifier	$n$ (all samples)	$n_k + n_l$ (2 classes)
Inherent imbalance	Yes (severe for large K)	No (each is 2-class)
Training cost	$O(K \cdot n \cdot d)$	$O(K^2 \cdot \bar{n} \cdot d)$ where $\bar{n} \ll n$
Prediction cost	$O(K)$	$O(K^2)$
Memory	$O(K \cdot d)$	$O(K^2 \cdot d)$

When to Use OvO:

Base classifier scales poorly with $n$: Each OvO classifier trains on much less data. For SVMs with $O(n^2)$ to $O(n^3)$ complexity, OvO can be faster overall.
Balanced problems preferred: OvO naturally creates balanced binary problems.
K is small: The $O(K^2)$ classifier count is acceptable for small $K$.

Challenges:

Voting ties: Need tie-breaking strategy
Many classifiers: For $K=100$, need 4,950 classifiers
Small training sets: Each classifier sees only 2 classes' data

sklearn SVM: SVC uses OvO internally by default (the 'ovo' implementation from libsvm).

ECOC: Error-Correcting Output Codes

Practical Implementation Guidelines

Let's consolidate practical guidance for implementing multi-class classification.

Decision Flowchart:

Is base learner logistic regression or neural network? → Use multinomial (softmax) by default. It's simpler, avoids imbalance, better calibrated.
Is base learner SVM? → Consider OvO (sklearn default) for small-medium K → Consider OvA for large K where $O(K^2)$ classifiers is expensive
Need to incrementally add classes? → OvA is more modular (train one new classifier)
Need parallel training across machines? → OvA is embarrassingly parallel
Very large K (1000+ classes)? → Multinomial with hierarchical softmax or approximate methods

multiclass_sklearn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
 
def compare_multiclass_strategies(X, y, cv=5):
    """
    Compare different multi-class strategies on a dataset.
    
    Args:
        X: Features
        y: Multi-class labels
        cv: Cross-validation folds
    """
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    results = {}
    
    # 1. Multinomial Logistic Regression
    lr_multi = LogisticRegression(
        multi_class='multinomial',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    )
    scores = cross_val_score(lr_multi, X_scaled, y, cv=cv)
    results['LR Multinomial'] = scores
    print(f"LR Multinomial: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 2. OvR Logistic Regression
    lr_ovr = LogisticRegression(
        multi_class='ovr',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    )
    scores = cross_val_score(lr_ovr, X_scaled, y, cv=cv)
    results['LR OvR'] = scores
    print(f"LR OvR:         {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 3. OvR with class weights for balance
    lr_ovr_balanced = LogisticRegression(
        multi_class='ovr',
        solver='lbfgs',
        max_iter=1000,
        class_weight='balanced',
        random_state=42
    )
    scores = cross_val_score(lr_ovr_balanced, X_scaled, y, cv=cv)
    results['LR OvR Balanced'] = scores
    print(f"LR OvR (balanced): {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 4. SVM with OvO (default)
    svm_ovo = SVC(kernel='rbf', random_state=42)
    scores = cross_val_score(svm_ovo, X_scaled, y, cv=cv)
    results['SVM OvO'] = scores
    print(f"SVM OvO:        {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    # 5. SVM with OvR (wrapped)
    svm_ovr = OneVsRestClassifier(SVC(kernel='rbf', random_state=42))
    scores = cross_val_score(svm_ovr, X_scaled, y, cv=cv)
    results['SVM OvR'] = scores
    print(f"SVM OvR:        {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
    
    return results
 
# Example
if __name__ == "__main__":
    from sklearn.datasets import load_digits
    
    # Load multi-class dataset (10 classes: digits 0-9)
    digits = load_digits()
    X, y = digits.data, digits.target
    
    print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, "
          f"{len(np.unique(y))} classes")
    print()
    
    results = compare_multiclass_strategies(X, y)
    
    print("
=== Key Observations ===")
    print("- Multinomial typically performs best for logistic regression")
    print("- OvR with balanced weights helps with imbalanced decomposition")
    print("- SVM OvO vs OvR depends on dataset; OvO often slightly better")

Summary: One-vs-All Strategy

We have thoroughly explored the one-vs-all strategy for multi-class classification. Let's consolidate the key insights:

Key Takeaways

•OvA Approach: Train $K$ binary classifiers, each distinguishing one class from all others.
•Prediction: Use max-score rule; can apply softmax to logits for probability output.
•Inherent Imbalance: Each classifier faces $(K-1):1$ imbalance—address with class weights or resampling.
•Decision Boundaries: Both OvA and multinomial create linear boundaries, but parameter estimation differs.
•Advantages: Parallelizable training, modular class addition, works with any binary learner.
•Disadvantages: Artificial imbalance, no joint class modeling, inconsistent probability outputs.
•Alternative—OvO: Trains $K(K-1)/2$ classifiers, each on two classes; naturally balanced but scales poorly with $K$.
•Practical Guidance: Prefer multinomial for logistic regression; consider OvA/OvO for SVMs based on $K$ and $n$.

What's Next:

Page Complete

4 / 5