Loading learning content...
Multinomial logistic regression elegantly handles all $K$ classes simultaneously through the softmax function. But there's an alternative philosophy: decompose the multi-class problem into multiple binary classification problems.
The one-vs-all (OvA) strategy—also known as one-vs-rest (OvR)—is the simplest and most widely used decomposition method. It trains $K$ independent binary classifiers, each distinguishing one class from all others.
This approach has historical significance (predating efficient multinomial methods), practical utility (embarrassingly parallel training), and theoretical interest (different inductive biases). Understanding OvA deepens appreciation for the design choices underlying multi-class classification.
By the end of this page, you will understand: how to construct and train OvA classifiers; prediction strategies for combining binary outputs; theoretical comparison with multinomial methods; the decision boundary geometry; handling of imbalanced decomposed problems; and when to prefer OvA versus softmax approaches.
Core Idea
Given $K$ classes ${1, 2, \ldots, K}$, train $K$ binary classifiers:
Each classifier $f_k$ answers: 'Does this example belong to class $k$ or not?'
Training Data Transformation
For training classifier $f_k$: $$y_i^{(k)} = \begin{cases} +1 & \text{if } y_i = k \ -1 & \text{if } y_i eq k \end{cases}$$
All original training examples are used, with relabeled targets. Classifier $f_k$ sees:
Base Classifier Choice
OvA works with any binary classifier:
The choice affects both training efficiency and prediction strategy.
OvA with Logistic Regression
Each classifier $f_k$ is a binary logistic regression: $$f_k(\mathbf{x}) = \sigma(\mathbf{w}_k^T \mathbf{x} + b_k) = \frac{1}{1 + e^{-(\mathbf{w}_k^T \mathbf{x} + b_k)}}$$
This outputs $P(y = k | \mathbf{x})$ under the model that assumes a binary world of class $k$ vs. 'everything else.'
Parameter Count:
OvA has slightly more parameters, but they're estimated independently.
OvA trains $K$ classifiers independently—no shared information flow. This enables parallel training but means each classifier is oblivious to the task of distinguishing between non-target classes. The multinomial approach, by contrast, jointly learns all class relationships.
Given $K$ trained classifiers ${f_1, \ldots, f_K}$, how do we predict the class for a new input $\mathbf{x}$? Several strategies exist.
Strategy 1: Maximum Score (Winner-Take-All)
Predict the class whose classifier outputs the highest score:
$$\hat{y} = \arg\max_{k \in {1, \ldots, K}} f_k(\mathbf{x})$$
For logistic regression classifiers, $f_k(\mathbf{x}) = \sigma(z_k)$ where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$.
Since $\sigma$ is monotonic, this is equivalent to:
$$\hat{y} = \arg\max_{k} z_k = \arg\max_{k} (\mathbf{w}_k^T \mathbf{x} + b_k)$$
Observation: This is exactly the same prediction rule as multinomial logistic regression! However, the parameters $\mathbf{w}_k$ are estimated differently.
Strategy 2: Probability Normalization
The raw outputs $f_k(\mathbf{x})$ don't sum to 1 (each is an independent binary probability). To get a proper distribution, normalize:
$$P_{\text{OvA}}(y=k|\mathbf{x}) = \frac{f_k(\mathbf{x})}{\sum_{j=1}^{K} f_j(\mathbf{x})}$$
Warning: This normalization is ad-hoc. The binary probabilities aren't designed to be normalized—they answer different questions. Use with caution.
Strategy 3: Softmax on Logits
Apply softmax to the raw logits from each classifier:
$$P_{\text{OvA-softmax}}(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$
where $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$.
This produces a proper probability distribution and is commonly used when probability outputs are needed from OvA classifiers.
Note: The resulting probabilities are generally not the same as those from true multinomial logistic regression, even though the formula looks identical. The difference is in how $\mathbf{w}_k$ were trained.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, classification_report class OneVsAllClassifier: """ One-vs-All multi-class classifier using logistic regression. """ def __init__(self, C=1.0, max_iter=1000): self.C = C self.max_iter = max_iter self.classifiers = [] self.classes_ = None def fit(self, X, y): """ Train K binary classifiers. Args: X: Features (n_samples, n_features) y: Labels (n_samples,) """ self.classes_ = np.unique(y) K = len(self.classes_) self.classifiers = [] for k, class_label in enumerate(self.classes_): # Create binary labels: class k = 1, all others = 0 y_binary = (y == class_label).astype(int) # Train binary logistic regression clf = LogisticRegression( C=self.C, max_iter=self.max_iter, solver='lbfgs', random_state=42 ) clf.fit(X, y_binary) self.classifiers.append(clf) print(f"Trained classifier {k+1}/{K}: " f"Class '{class_label}' vs rest") return self def predict(self, X): """ Predict class labels using maximum score strategy. """ scores = self.decision_function(X) return self.classes_[np.argmax(scores, axis=1)] def decision_function(self, X): """ Get raw scores (logits) from each classifier. Returns: scores: (n_samples, K) matrix """ scores = np.zeros((X.shape[0], len(self.classes_))) for k, clf in enumerate(self.classifiers): # Use decision function (raw logit) not probability scores[:, k] = clf.decision_function(X) return scores def predict_proba_raw(self, X): """ Get raw probabilities from each binary classifier. Note: These don't sum to 1! """ probs = np.zeros((X.shape[0], len(self.classes_))) for k, clf in enumerate(self.classifiers): probs[:, k] = clf.predict_proba(X)[:, 1] # P(class k) return probs def predict_proba_normalized(self, X): """ Get probabilities normalized to sum to 1. """ probs = self.predict_proba_raw(X) return probs / probs.sum(axis=1, keepdims=True) def predict_proba_softmax(self, X): """ Apply softmax to logits for probability output. """ logits = self.decision_function(X) exp_logits = np.exp(logits - logits.max(axis=1, keepdims=True)) return exp_logits / exp_logits.sum(axis=1, keepdims=True) # Comparison: OvA vs Multinomialdef compare_ova_multinomial(X, y, test_size=0.2): """ Compare OvA and multinomial logistic regression. """ X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=test_size, random_state=42, stratify=y ) # Standardize scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # OvA classifier print("=== Training One-vs-All ===") ova = OneVsAllClassifier(C=1.0) ova.fit(X_train_scaled, y_train) ova_pred = ova.predict(X_test_scaled) ova_acc = accuracy_score(y_test, ova_pred) # Multinomial classifier print("=== Training Multinomial ===") multi = LogisticRegression( C=1.0, multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42 ) multi.fit(X_train_scaled, y_train) multi_pred = multi.predict(X_test_scaled) multi_acc = accuracy_score(y_test, multi_pred) print(f"=== Results ===") print(f"OvA Accuracy: {ova_acc:.4f}") print(f"Multinomial Accuracy: {multi_acc:.4f}") # Compare probability outputs print("=== Probability Comparison (first 3 samples) ===") ova_probs = ova.predict_proba_softmax(X_test_scaled[:3]) multi_probs = multi.predict_proba(X_test_scaled[:3]) print("OvA (softmax on logits):") print(ova_probs.round(3)) print("Multinomial:") print(multi_probs.round(3)) return ova, multi # Exampleif __name__ == "__main__": from sklearn.datasets import make_classification # Generate multi-class data X, y = make_classification( n_samples=1000, n_features=10, n_informative=5, n_redundant=2, n_classes=4, n_clusters_per_class=1, random_state=42 ) ova, multi = compare_ova_multinomial(X, y)Both OvA and multinomial logistic regression produce linear decision boundaries, but with subtle differences in their structure.
Multinomial Decision Boundaries
Recall that the boundary between classes $k$ and $l$ in multinomial logistic regression is: $$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$
All $\binom{K}{2}$ pairwise boundaries pass through a common central point in weight space. The decision regions form a Voronoi-like partition with boundaries radiating from a common center.
OvA Decision Boundaries
In OvA, the boundary between classes $k$ and $l$ occurs where their scores are equal: $$f_k(\mathbf{x}) = f_l(\mathbf{x})$$
For logistic regression: $z_k = z_l$, giving: $$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$
This looks identical—but the parameter values differ because of how they were trained.
Key Geometric Differences
Boundary Orientation: OvA boundaries may not pass through a common point, creating more 'irregular' decision regions.
Undefined Regions: OvA can create regions where multiple classifiers claim the point (all say 'yes') or no classifier claims it (all say 'no'). The max-score rule resolves these ambiguities.
Margin Properties: Each OvA classifier has its own margin (defined relative to class $k$ vs. all others). These margins aren't coordinated—one class might have wide margins while another is tightly bounded.
The 'Ties' Issue
In principle, ties can occur where $f_k(\mathbf{x}) = f_l(\mathbf{x})$. In practice:
In 2D with 3 classes, multinomial boundaries typically form a 'Y' shape meeting at one point. OvA boundaries might form an 'irregular Y' that doesn't meet perfectly, or might create a small triangular region where the max-score decision differs from what a multinomial model would produce.
OvA inherently creates imbalanced training problems—a critical issue that requires careful handling.
The Inherent Imbalance
For classifier $f_k$ distinguishing class $k$ from all others:
With $K$ balanced classes (each with $n/K$ samples):
For $K=10$ classes: 9:1 imbalance. For $K=100$: 99:1 imbalance!
Effects of Imbalance:
| Number of Classes (K) | Imbalance Ratio | Effect |
|---|---|---|
| 2 | 1:1 | No imbalance (standard binary) |
| 3 | 2:1 | Mild imbalance |
| 10 | 9:1 | Significant imbalance |
| 100 | 99:1 | Severe imbalance |
| 1000 | 999:1 | Extreme imbalance |
Mitigation Strategies
1. Class Weighting
Weight positive examples higher to balance effective sample sizes: $$w_{+} = \frac{n - n_k}{n_k}, \quad w_{-} = 1$$
This makes the effective positive count equal to negative count.
2. Resampling
3. Threshold Adjustment
Instead of predicting positive when $f_k(\mathbf{x}) > 0.5$, lower the threshold: $$\text{Predict class } k \text{ if } f_k(\mathbf{x}) > \tau_k$$
Set $\tau_k$ to achieve desired precision/recall balance, e.g., using ROC analysis on validation set.
4. Platt Scaling
Post-hoc recalibration of probabilities using logistic regression on held-out validation set outputs.
Multinomial logistic regression doesn't suffer from artificial imbalance—it models all classes jointly using their natural frequencies. This is one reason modern libraries often default to 'multinomial' rather than 'ovr' for multi-class logistic regression.
When is OvA preferable to multinomial, and vice versa? Let's examine the theoretical tradeoffs.
Consistency Analysis
A classifier is Bayes consistent if it converges to the Bayes optimal classifier as training data grows to infinity.
Result (Rifkin & Klautau, 2004): Under mild conditions, OvA with consistent binary classifiers is also consistent for multi-class classification.
Intuitively: If each $f_k$ correctly identifies class $k$ vs. others, the max-score rule correctly identifies the true class.
However: Consistency is an asymptotic property. In finite samples, the approaches can differ significantly.
Error Decomposition
For OvA, two types of errors occur:
Multinomial logistic regression, by modeling all classes jointly, optimizes a single coherent objective that balances these concerns.
Empirical Findings
Research (Rifkin & Klautau, 2004; Hsu & Lin, 2002) shows:
For well-calibrated base classifiers (logistic regression), multinomial is often slightly better due to joint optimization
For margin-based classifiers (SVM), OvA often performs comparably or better because SVM focuses on margin, not probability calibration
As K increases, the imbalance problem in OvA becomes more severe, favoring multinomial
For hierarchical class structures, OvA can be more natural (e.g., one classifier for 'animal' vs 'vehicle', then refine)
sklearn Defaults:
LogisticRegression: Defaults to multi_class='auto', which chooses 'multinomial' for lbfgs/sag solvers, 'ovr' for liblinearSVC: decision_function_shape='ovr' by defaultFor completeness, let's briefly examine one-vs-one (OvO), another decomposition strategy.
OvO Strategy
Train a binary classifier for every pair of classes:
Training: For classifier $f_{kl}$, use only examples where $y \in {k, l}$.
Prediction (Voting):
For a new input $\mathbf{x}$, each classifier $f_{kl}$ votes for either $k$ or $l$: $$\hat{y} = \arg\max_k \sum_{l eq k} \mathbf{1}[f_{kl}(\mathbf{x}) \text{ votes for } k]$$
The class with the most votes wins.
| Aspect | One-vs-All (OvA) | One-vs-One (OvO) |
|---|---|---|
| Number of classifiers | $K$ | $K(K-1)/2$ |
| Training set per classifier | $n$ (all samples) | $n_k + n_l$ (2 classes) |
| Inherent imbalance | Yes (severe for large K) | No (each is 2-class) |
| Training cost | $O(K \cdot n \cdot d)$ | $O(K^2 \cdot \bar{n} \cdot d)$ where $\bar{n} \ll n$ |
| Prediction cost | $O(K)$ | $O(K^2)$ |
| Memory | $O(K \cdot d)$ | $O(K^2 \cdot d)$ |
When to Use OvO:
Base classifier scales poorly with $n$: Each OvO classifier trains on much less data. For SVMs with $O(n^2)$ to $O(n^3)$ complexity, OvO can be faster overall.
Balanced problems preferred: OvO naturally creates balanced binary problems.
K is small: The $O(K^2)$ classifier count is acceptable for small $K$.
Challenges:
sklearn SVM:
SVC uses OvO internally by default (the 'ovo' implementation from libsvm).
A generalization of OvA and OvO where classes are encoded as binary vectors (codewords). Each column of the code matrix defines a binary classification task. Prediction decodes by finding the closest codeword. ECOC can provide error-correction properties, but is more complex to design and implement.
Let's consolidate practical guidance for implementing multi-class classification.
Decision Flowchart:
Is base learner logistic regression or neural network? → Use multinomial (softmax) by default. It's simpler, avoids imbalance, better calibrated.
Is base learner SVM? → Consider OvO (sklearn default) for small-medium K → Consider OvA for large K where $O(K^2)$ classifiers is expensive
Need to incrementally add classes? → OvA is more modular (train one new classifier)
Need parallel training across machines? → OvA is embarrassingly parallel
Very large K (1000+ classes)? → Multinomial with hierarchical softmax or approximate methods
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
from sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCfrom sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_scoreimport numpy as np def compare_multiclass_strategies(X, y, cv=5): """ Compare different multi-class strategies on a dataset. Args: X: Features y: Multi-class labels cv: Cross-validation folds """ # Standardize features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) results = {} # 1. Multinomial Logistic Regression lr_multi = LogisticRegression( multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42 ) scores = cross_val_score(lr_multi, X_scaled, y, cv=cv) results['LR Multinomial'] = scores print(f"LR Multinomial: {scores.mean():.4f} (+/- {scores.std()*2:.4f})") # 2. OvR Logistic Regression lr_ovr = LogisticRegression( multi_class='ovr', solver='lbfgs', max_iter=1000, random_state=42 ) scores = cross_val_score(lr_ovr, X_scaled, y, cv=cv) results['LR OvR'] = scores print(f"LR OvR: {scores.mean():.4f} (+/- {scores.std()*2:.4f})") # 3. OvR with class weights for balance lr_ovr_balanced = LogisticRegression( multi_class='ovr', solver='lbfgs', max_iter=1000, class_weight='balanced', random_state=42 ) scores = cross_val_score(lr_ovr_balanced, X_scaled, y, cv=cv) results['LR OvR Balanced'] = scores print(f"LR OvR (balanced): {scores.mean():.4f} (+/- {scores.std()*2:.4f})") # 4. SVM with OvO (default) svm_ovo = SVC(kernel='rbf', random_state=42) scores = cross_val_score(svm_ovo, X_scaled, y, cv=cv) results['SVM OvO'] = scores print(f"SVM OvO: {scores.mean():.4f} (+/- {scores.std()*2:.4f})") # 5. SVM with OvR (wrapped) svm_ovr = OneVsRestClassifier(SVC(kernel='rbf', random_state=42)) scores = cross_val_score(svm_ovr, X_scaled, y, cv=cv) results['SVM OvR'] = scores print(f"SVM OvR: {scores.mean():.4f} (+/- {scores.std()*2:.4f})") return results # Exampleif __name__ == "__main__": from sklearn.datasets import load_digits # Load multi-class dataset (10 classes: digits 0-9) digits = load_digits() X, y = digits.data, digits.target print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, " f"{len(np.unique(y))} classes") print() results = compare_multiclass_strategies(X, y) print("=== Key Observations ===") print("- Multinomial typically performs best for logistic regression") print("- OvR with balanced weights helps with imbalanced decomposition") print("- SVM OvO vs OvR depends on dataset; OvO often slightly better")We have thoroughly explored the one-vs-all strategy for multi-class classification. Let's consolidate the key insights:
What's Next:
With the model, loss, and multi-class strategies established, we turn to computational considerations—the practical aspects of training multinomial logistic regression at scale, including optimization algorithms, mini-batch training, and efficiency considerations that enable handling millions of samples and thousands of classes.
You now understand the one-vs-all decomposition strategy—its construction, prediction methods, geometric properties, and tradeoffs compared to multinomial approaches. This knowledge enables informed choice of classification strategy based on problem characteristics.