Loading learning content...
A classifier achieves 90% accuracy on a binary classification task. Impressive? Perhaps not—if the dataset is 90% negative class, a trivial classifier that always predicts 'negative' achieves the same 90% accuracy with zero discriminative ability.
This fundamental limitation of accuracy motivates Cohen's kappa (κ), a statistic that measures agreement between predictions and ground truth while accounting for agreement expected by chance. Originally developed for inter-rater reliability in psychology, kappa has become essential for evaluating classifiers on imbalanced datasets where raw accuracy is misleading.
By the end of this page, you will derive and compute Cohen's kappa, understand its probabilistic interpretation, recognize when kappa is superior to accuracy, interpret kappa values using established guidelines, and apply kappa in multi-class scenarios.
Cohen's kappa quantifies how much better the classifier performs compared to random guessing. It normalizes observed agreement by the agreement expected under statistical independence.
κ = (pₒ - pₑ) / (1 - pₑ)
Where: • pₒ = observed agreement (proportion of cases where prediction = truth) • pₑ = expected agreement by chance (if predictions were independent of truth)
The numerator measures how much better than chance you are. The denominator normalizes by the maximum possible improvement over chance.
Deriving expected agreement (pₑ):
If predictions and ground truth were statistically independent, the probability of both agreeing on class k would be:
P(pred=k, truth=k | independent) = P(pred=k) × P(truth=k)
For all classes:
pₑ = Σₖ P(pred=k) × P(truth=k)
These marginal probabilities are estimated from the confusion matrix:
Step-by-step computation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom sklearn.metrics import cohen_kappa_score, confusion_matrixfrom typing import Tuple def compute_kappa_manual(y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, dict]: """ Compute Cohen's kappa with full derivation details. Returns kappa and intermediate values for educational purposes. """ # Build confusion matrix classes = np.unique(np.concatenate([y_true, y_pred])) K = len(classes) N = len(y_true) C = confusion_matrix(y_true, y_pred, labels=classes) # Observed agreement: diagonal sum / total p_o = np.trace(C) / N # Expected agreement under independence # P(pred=k) = column sum / N # P(truth=k) = row sum / N row_sums = C.sum(axis=1) # P(truth=k) * N col_sums = C.sum(axis=0) # P(pred=k) * N p_e = np.sum(row_sums * col_sums) / (N * N) # Cohen's kappa if p_e == 1.0: kappa = 1.0 if p_o == 1.0 else 0.0 else: kappa = (p_o - p_e) / (1 - p_e) details = { 'confusion_matrix': C, 'n_samples': N, 'n_classes': K, 'observed_agreement': p_o, 'expected_agreement': p_e, 'improvement_over_chance': p_o - p_e, 'max_possible_improvement': 1 - p_e, 'kappa': kappa } return kappa, details def compare_kappa_vs_accuracy(): """ Demonstrate when kappa differs significantly from accuracy. """ # Scenario 1: Balanced dataset, good classifier y_true_balanced = np.array([0]*100 + [1]*100) y_pred_balanced = np.array([0]*90 + [1]*10 + [1]*85 + [0]*15) # Scenario 2: Imbalanced dataset, trivial classifier y_true_imbalanced = np.array([0]*180 + [1]*20) y_pred_trivial = np.array([0]*200) # Always predict majority # Scenario 3: Imbalanced dataset, good classifier y_true_imbalanced2 = np.array([0]*180 + [1]*20) y_pred_good = np.array([0]*170 + [1]*10 + [1]*18 + [0]*2) scenarios = [ ("Balanced + Good", y_true_balanced, y_pred_balanced), ("Imbalanced + Trivial", y_true_imbalanced, y_pred_trivial), ("Imbalanced + Discriminative", y_true_imbalanced2, y_pred_good), ] print("Kappa vs Accuracy Comparison:") print("=" * 65) print(f"{'Scenario':<28} {'Accuracy':<12} {'Kappa':<12} {'p_e':<12}") print("-" * 65) for name, y_t, y_p in scenarios: acc = np.mean(y_t == y_p) kappa, details = compute_kappa_manual(y_t, y_p) print(f"{name:<28} {acc:<12.3f} {kappa:<12.3f} {details['expected_agreement']:<12.3f}") print("-" * 65) print("Note: Trivial classifier has high accuracy but kappa ≈ 0") if __name__ == "__main__": compare_kappa_vs_accuracy()Cohen's kappa ranges from -1 to +1, with distinct interpretive regions. Several widely-cited interpretation scales exist, though the original Landis & Koch (1977) scale remains most common.
| Kappa Range | Interpretation | Practical Meaning |
|---|---|---|
| κ < 0 | Less than chance | Classifier is systematically wrong (possibly inverted labels) |
| 0.00 – 0.20 | Slight agreement | Minimal practical utility; little better than guessing |
| 0.21 – 0.40 | Fair agreement | Some discriminative power, but unreliable |
| 0.41 – 0.60 | Moderate agreement | Reasonable for screening, insufficient for critical decisions |
| 0.61 – 0.80 | Substantial agreement | Good for most applications |
| 0.81 – 1.00 | Almost perfect agreement | Excellent; suitable for high-stakes decisions |
These guidelines are heuristics, not absolute standards. A κ = 0.6 may be unacceptable for medical diagnosis but excellent for sentiment analysis. Always interpret kappa relative to:
• Task difficulty (harder tasks have lower expected kappa) • Baseline methods (compare to previous approaches) • Cost of errors (high-stakes applications need higher kappa)
Special kappa values:
Relationship to accuracy:
The key advantage of kappa is its chance correction. This makes it particularly valuable in several scenarios where raw accuracy fails.
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npfrom sklearn.metrics import cohen_kappa_score, accuracy_score def demonstrate_kappa_advantage(): """Show situations where kappa provides critical insights.""" # Scenario: Fraud detection (1% fraud rate) np.random.seed(42) n = 10000 # Ground truth: 1% fraud y_true = np.array([0] * 9900 + [1] * 100) # Model A: Always predicts non-fraud (trivial) y_pred_trivial = np.zeros(n) # Model B: Catches 50% of fraud with 2% false positive rate y_pred_good = np.zeros(n) y_pred_good[9700:9900] = 1 # 200 false positives (2%) y_pred_good[9900:9950] = 1 # 50 true positives (50% recall) print("Fraud Detection Comparison (1% fraud rate):") print("=" * 55) print(f"{'Model':<25} {'Accuracy':<12} {'Kappa':<12}") print("-" * 55) acc_t = accuracy_score(y_true, y_pred_trivial) kap_t = cohen_kappa_score(y_true, y_pred_trivial) print(f"{'Trivial (always 0)':<25} {acc_t:<12.3f} {kap_t:<12.3f}") acc_g = accuracy_score(y_true, y_pred_good) kap_g = cohen_kappa_score(y_true, y_pred_good) print(f"{'Discriminative':<25} {acc_g:<12.3f} {kap_g:<12.3f}") print("-" * 55) print("\nInterpretation:") print(f" Trivial model: {acc_t*100:.1f}% accuracy but κ={kap_t:.3f} (useless)") print(f" Good model: {acc_g*100:.1f}% accuracy with κ={kap_g:.3f} (discriminates)") print(" Accuracy difference: {:.1f}% (misleading)".format((acc_t - acc_g) * 100)) print(" Kappa difference: {:.3f} (reveals true value)".format(kap_g - kap_t)) if __name__ == "__main__": demonstrate_kappa_advantage()Cohen's kappa extends naturally to multi-class problems. The same formula applies, but expected agreement now accounts for all K classes.
For K classes:
pₑ = Σₖ P(pred=k) × P(truth=k) = Σₖ (col_sumₖ/N) × (row_sumₖ/N)
The computation remains identical—just sum over more classes.
Weighted kappa for ordinal classes:
When classes have natural ordering (e.g., ratings 1-5), standard kappa treats all disagreements equally. Weighted kappa assigns different penalties based on distance:
Quadratic weighting penalizes large disagreements more heavily, making it preferred for ordinal scales.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom sklearn.metrics import cohen_kappa_score def compute_weighted_kappa( y_true: np.ndarray, y_pred: np.ndarray, weights: str = 'quadratic') -> float: """ Compute weighted Cohen's kappa for ordinal classification. Parameters ---------- y_true, y_pred : array-like True and predicted labels weights : {'linear', 'quadratic'} Weighting scheme for disagreements Returns ------- weighted_kappa : float """ return cohen_kappa_score(y_true, y_pred, weights=weights) def demonstrate_weighted_kappa(): """Show effect of weighting on ordinal classification.""" # Product rating prediction (1-5 stars) y_true = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5] * 20) # Prediction A: Off by 1 on most predictions y_pred_close = np.clip(y_true + np.random.choice([-1, 0, 1], len(y_true)), 1, 5) # Prediction B: Occasionally very wrong (1 predicted as 5) y_pred_extreme = y_true.copy() extreme_errors = np.random.choice(len(y_true), 20) y_pred_extreme[extreme_errors] = 6 - y_true[extreme_errors] # Flip extremes print("Weighted Kappa for Ordinal Classification:") print("=" * 60) print(f"{'Prediction Type':<20} {'Unweighted':<15} {'Linear':<15} {'Quadratic':<15}") print("-" * 60) for name, y_p in [("Close errors", y_pred_close), ("Extreme errors", y_pred_extreme)]: k_none = cohen_kappa_score(y_true, y_p, weights=None) k_lin = cohen_kappa_score(y_true, y_p, weights='linear') k_quad = cohen_kappa_score(y_true, y_p, weights='quadratic') print(f"{name:<20} {k_none:<15.3f} {k_lin:<15.3f} {k_quad:<15.3f}") print("-" * 60) print("Note: Quadratic weights penalize extreme errors more severely") if __name__ == "__main__": demonstrate_weighted_kappa()Despite its advantages, Cohen's kappa has known limitations that practitioners should understand.
Alternatives to consider:
You now understand Cohen's kappa as a chance-corrected measure of agreement. Next, we'll explore the multi-class confusion matrix in depth—the foundation from which all multi-class metrics are derived.