Machine LearningMulti-class Metrics

Multi-class Classification Metrics

LevelIntermediate

Duration90 mins

TopicMulti-class Metrics

3 / 5

Cohen's Kappa

Beyond Raw Accuracy

A classifier achieves 90% accuracy on a binary classification task. Impressive? Perhaps not—if the dataset is 90% negative class, a trivial classifier that always predicts 'negative' achieves the same 90% accuracy with zero discriminative ability.

This fundamental limitation of accuracy motivates Cohen's kappa (κ), a statistic that measures agreement between predictions and ground truth while accounting for agreement expected by chance. Originally developed for inter-rater reliability in psychology, kappa has become essential for evaluating classifiers on imbalanced datasets where raw accuracy is misleading.

What You Will Master

By the end of this page, you will derive and compute Cohen's kappa, understand its probabilistic interpretation, recognize when kappa is superior to accuracy, interpret kappa values using established guidelines, and apply kappa in multi-class scenarios.

Mathematical Foundation

Cohen's kappa quantifies how much better the classifier performs compared to random guessing. It normalizes observed agreement by the agreement expected under statistical independence.

The Kappa Formula

κ = (pₒ - pₑ) / (1 - pₑ)

Where: • pₒ = observed agreement (proportion of cases where prediction = truth) • pₑ = expected agreement by chance (if predictions were independent of truth)

The numerator measures how much better than chance you are. The denominator normalizes by the maximum possible improvement over chance.

Deriving expected agreement (pₑ):

If predictions and ground truth were statistically independent, the probability of both agreeing on class k would be:

P(pred=k, truth=k | independent) = P(pred=k) × P(truth=k)

For all classes:

pₑ = Σₖ P(pred=k) × P(truth=k)

These marginal probabilities are estimated from the confusion matrix:

P(pred=k) = (column k sum) / total = (TPₖ + Σⱼ≠ₖ FPⱼ→ₖ) / N
P(truth=k) = (row k sum) / total = (TPₖ + FNₖ) / N

Step-by-step computation:

Compute confusion matrix C where Cᵢⱼ = count(truth=i, pred=j)
N = total samples = ΣᵢΣⱼ Cᵢⱼ
pₒ = Σₖ Cₖₖ / N (diagonal sum / total)
For each class k: P(pred=k) = Σᵢ Cᵢₖ / N, P(truth=k) = Σⱼ Cₖⱼ / N
pₑ = Σₖ P(pred=k) × P(truth=k)
κ = (pₒ - pₑ) / (1 - pₑ)

cohens_kappa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from typing import Tuple
 
def compute_kappa_manual(y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, dict]:
    """
    Compute Cohen's kappa with full derivation details.
    
    Returns kappa and intermediate values for educational purposes.
    """
    # Build confusion matrix
    classes = np.unique(np.concatenate([y_true, y_pred]))
    K = len(classes)
    N = len(y_true)
    
    C = confusion_matrix(y_true, y_pred, labels=classes)
    
    # Observed agreement: diagonal sum / total
    p_o = np.trace(C) / N
    
    # Expected agreement under independence
    # P(pred=k) = column sum / N
    # P(truth=k) = row sum / N
    row_sums = C.sum(axis=1)  # P(truth=k) * N
    col_sums = C.sum(axis=0)  # P(pred=k) * N
    
    p_e = np.sum(row_sums * col_sums) / (N * N)
    
    # Cohen's kappa
    if p_e == 1.0:
        kappa = 1.0 if p_o == 1.0 else 0.0
    else:
        kappa = (p_o - p_e) / (1 - p_e)
    
    details = {
        'confusion_matrix': C,
        'n_samples': N,
        'n_classes': K,
        'observed_agreement': p_o,
        'expected_agreement': p_e,
        'improvement_over_chance': p_o - p_e,
        'max_possible_improvement': 1 - p_e,
        'kappa': kappa
    }
    
    return kappa, details
 
 
def compare_kappa_vs_accuracy():
    """
    Demonstrate when kappa differs significantly from accuracy.
    """
    # Scenario 1: Balanced dataset, good classifier
    y_true_balanced = np.array([0]*100 + [1]*100)
    y_pred_balanced = np.array([0]*90 + [1]*10 + [1]*85 + [0]*15)
    
    # Scenario 2: Imbalanced dataset, trivial classifier
    y_true_imbalanced = np.array([0]*180 + [1]*20)
    y_pred_trivial = np.array([0]*200)  # Always predict majority
    
    # Scenario 3: Imbalanced dataset, good classifier
    y_true_imbalanced2 = np.array([0]*180 + [1]*20)
    y_pred_good = np.array([0]*170 + [1]*10 + [1]*18 + [0]*2)
    
    scenarios = [
        ("Balanced + Good", y_true_balanced, y_pred_balanced),
        ("Imbalanced + Trivial", y_true_imbalanced, y_pred_trivial),
        ("Imbalanced + Discriminative", y_true_imbalanced2, y_pred_good),
    ]
    
    print("Kappa vs Accuracy Comparison:")
    print("=" * 65)
    print(f"{'Scenario':<28} {'Accuracy':<12} {'Kappa':<12} {'p_e':<12}")
    print("-" * 65)
    
    for name, y_t, y_p in scenarios:
        acc = np.mean(y_t == y_p)
        kappa, details = compute_kappa_manual(y_t, y_p)
        print(f"{name:<28} {acc:<12.3f} {kappa:<12.3f} {details['expected_agreement']:<12.3f}")
    
    print("-" * 65)
    print("Note: Trivial classifier has high accuracy but kappa ≈ 0")
 
 
if __name__ == "__main__":
    compare_kappa_vs_accuracy()

Interpretation Guidelines

Cohen's kappa ranges from -1 to +1, with distinct interpretive regions. Several widely-cited interpretation scales exist, though the original Landis & Koch (1977) scale remains most common.

Kappa Interpretation Scale (Landis & Koch, 1977)
Kappa Range	Interpretation	Practical Meaning
κ < 0	Less than chance	Classifier is systematically wrong (possibly inverted labels)
0.00 – 0.20	Slight agreement	Minimal practical utility; little better than guessing
0.21 – 0.40	Fair agreement	Some discriminative power, but unreliable
0.41 – 0.60	Moderate agreement	Reasonable for screening, insufficient for critical decisions
0.61 – 0.80	Substantial agreement	Good for most applications
0.81 – 1.00	Almost perfect agreement	Excellent; suitable for high-stakes decisions

Context Matters

These guidelines are heuristics, not absolute standards. A κ = 0.6 may be unacceptable for medical diagnosis but excellent for sentiment analysis. Always interpret kappa relative to:

• Task difficulty (harder tasks have lower expected kappa) • Baseline methods (compare to previous approaches) • Cost of errors (high-stakes applications need higher kappa)

Special kappa values:

κ = 1: Perfect agreement; classifier matches ground truth exactly
κ = 0: Agreement equals chance; no discriminative ability
κ < 0: Systematic disagreement; worse than random (often indicates label inversion or severe bias)

Relationship to accuracy:

κ = 0 corresponds to accuracy = pₑ (chance-level accuracy)
κ = 1 corresponds to accuracy = 1 (perfect accuracy)
κ < accuracy when class distribution is imbalanced
κ ≈ accuracy when classes are balanced and pₑ ≈ 0.5

When Kappa Beats Accuracy

The key advantage of kappa is its chance correction. This makes it particularly valuable in several scenarios where raw accuracy fails.

Kappa Advantages Over Accuracy

•Imbalanced datasets: A 95% accuracy with 95% class imbalance tells you nothing. Kappa reveals whether the model actually discriminates.
•Comparing across datasets: Two classifiers with 80% accuracy on different datasets aren't comparable. Kappa normalizes for baseline difficulty.
•Detecting trivial classifiers: A model that always predicts the majority class has κ ≈ 0 regardless of accuracy.
•Multi-rater agreement: When comparing multiple annotators or models, kappa provides a fair comparison independent of marginal distributions.
•Quality control in labeling: Kappa measures labeler consistency beyond what chance would produce.

kappa_advantages.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from sklearn.metrics import cohen_kappa_score, accuracy_score
 
def demonstrate_kappa_advantage():
    """Show situations where kappa provides critical insights."""
    
    # Scenario: Fraud detection (1% fraud rate)
    np.random.seed(42)
    n = 10000
    
    # Ground truth: 1% fraud
    y_true = np.array([0] * 9900 + [1] * 100)
    
    # Model A: Always predicts non-fraud (trivial)
    y_pred_trivial = np.zeros(n)
    
    # Model B: Catches 50% of fraud with 2% false positive rate
    y_pred_good = np.zeros(n)
    y_pred_good[9700:9900] = 1  # 200 false positives (2%)
    y_pred_good[9900:9950] = 1  # 50 true positives (50% recall)
    
    print("Fraud Detection Comparison (1% fraud rate):")
    print("=" * 55)
    print(f"{'Model':<25} {'Accuracy':<12} {'Kappa':<12}")
    print("-" * 55)
    
    acc_t = accuracy_score(y_true, y_pred_trivial)
    kap_t = cohen_kappa_score(y_true, y_pred_trivial)
    print(f"{'Trivial (always 0)':<25} {acc_t:<12.3f} {kap_t:<12.3f}")
    
    acc_g = accuracy_score(y_true, y_pred_good)
    kap_g = cohen_kappa_score(y_true, y_pred_good)
    print(f"{'Discriminative':<25} {acc_g:<12.3f} {kap_g:<12.3f}")
    
    print("-" * 55)
    print("
Interpretation:")
    print(f"  Trivial model: {acc_t*100:.1f}% accuracy but κ={kap_t:.3f} (useless)")
    print(f"  Good model: {acc_g*100:.1f}% accuracy with κ={kap_g:.3f} (discriminates)")
    print("  Accuracy difference: {:.1f}% (misleading)".format((acc_t - acc_g) * 100))
    print("  Kappa difference: {:.3f} (reveals true value)".format(kap_g - kap_t))
 
if __name__ == "__main__":
    demonstrate_kappa_advantage()

Multi-class Cohen's Kappa

Cohen's kappa extends naturally to multi-class problems. The same formula applies, but expected agreement now accounts for all K classes.

For K classes:

pₑ = Σₖ P(pred=k) × P(truth=k) = Σₖ (col_sumₖ/N) × (row_sumₖ/N)

The computation remains identical—just sum over more classes.

Weighted kappa for ordinal classes:

When classes have natural ordering (e.g., ratings 1-5), standard kappa treats all disagreements equally. Weighted kappa assigns different penalties based on distance:

Linear weights: wᵢⱼ = |i - j| / (K - 1)
Quadratic weights: wᵢⱼ = (i - j)² / (K - 1)²

Quadratic weighting penalizes large disagreements more heavily, making it preferred for ordinal scales.

weighted_kappa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from sklearn.metrics import cohen_kappa_score
 
def compute_weighted_kappa(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    weights: str = 'quadratic'
) -> float:
    """
    Compute weighted Cohen's kappa for ordinal classification.
    
    Parameters
    ----------
    y_true, y_pred : array-like
        True and predicted labels
    weights : {'linear', 'quadratic'}
        Weighting scheme for disagreements
        
    Returns
    -------
    weighted_kappa : float
    """
    return cohen_kappa_score(y_true, y_pred, weights=weights)
 
 
def demonstrate_weighted_kappa():
    """Show effect of weighting on ordinal classification."""
    
    # Product rating prediction (1-5 stars)
    y_true = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5] * 20)
    
    # Prediction A: Off by 1 on most predictions
    y_pred_close = np.clip(y_true + np.random.choice([-1, 0, 1], len(y_true)), 1, 5)
    
    # Prediction B: Occasionally very wrong (1 predicted as 5)
    y_pred_extreme = y_true.copy()
    extreme_errors = np.random.choice(len(y_true), 20)
    y_pred_extreme[extreme_errors] = 6 - y_true[extreme_errors]  # Flip extremes
    
    print("Weighted Kappa for Ordinal Classification:")
    print("=" * 60)
    print(f"{'Prediction Type':<20} {'Unweighted':<15} {'Linear':<15} {'Quadratic':<15}")
    print("-" * 60)
    
    for name, y_p in [("Close errors", y_pred_close), ("Extreme errors", y_pred_extreme)]:
        k_none = cohen_kappa_score(y_true, y_p, weights=None)
        k_lin = cohen_kappa_score(y_true, y_p, weights='linear')
        k_quad = cohen_kappa_score(y_true, y_p, weights='quadratic')
        print(f"{name:<20} {k_none:<15.3f} {k_lin:<15.3f} {k_quad:<15.3f}")
    
    print("-" * 60)
    print("Note: Quadratic weights penalize extreme errors more severely")
 
if __name__ == "__main__":
    demonstrate_weighted_kappa()

Limitations and Alternatives

Despite its advantages, Cohen's kappa has known limitations that practitioners should understand.

Kappa Limitations

•Prevalence dependence: Kappa can be low even with high accuracy when prevalence is extreme, because pₑ becomes very high.
•Marginal distribution sensitivity: Two raters with the same accuracy but different prediction biases will have different kappas.
•Undefined for perfect agreement on rare classes: When a class has zero frequency, kappa computations can be unstable.
•Not directly comparable across studies: Kappa values depend on the specific marginal distributions in each study.
•Paradoxes possible: High kappa with low accuracy (or vice versa) can occur in pathological distributions.

Alternatives to consider:

Matthews Correlation Coefficient (MCC): Another chance-corrected metric; equivalent to kappa for binary classification but sometimes preferred
Balanced Accuracy: (Sensitivity + Specificity) / 2; simpler but less comprehensive
Macro-F1: Focuses on precision/recall rather than agreement
Fleiss' Kappa: Extension for multiple raters (> 2)
Krippendorff's Alpha: Handles missing data and multiple raters

Summary: Cohen's Kappa

Key Takeaways

•Kappa corrects for chance — It measures improvement over random agreement, revealing true discriminative ability.
•κ = (pₒ - pₑ) / (1 - pₑ) — Observed agreement minus expected, normalized by maximum possible improvement.
•Essential for imbalanced data — When classes are skewed, accuracy is misleading; kappa reveals trivial classifiers.
•Interpretation scales exist — Landis & Koch provide guidelines, but always consider task context.
•Weighted kappa for ordinal data — Use linear or quadratic weights when class ordering matters.
•Know the limitations — Kappa has paradoxes and marginal sensitivities; combine with other metrics.

Page Complete

You now understand Cohen's kappa as a chance-corrected measure of agreement. Next, we'll explore the multi-class confusion matrix in depth—the foundation from which all multi-class metrics are derived.

3 / 5

Loading learning content...

Machine LearningMulti-class Metrics

Multi-class Classification Metrics

LevelIntermediate

Duration90 mins

TopicMulti-class Metrics

3 / 5

Cohen's Kappa

Beyond Raw Accuracy

What You Will Master

Mathematical Foundation

Cohen's kappa quantifies how much better the classifier performs compared to random guessing. It normalizes observed agreement by the agreement expected under statistical independence.

The Kappa Formula

κ = (pₒ - pₑ) / (1 - pₑ)

Where: • pₒ = observed agreement (proportion of cases where prediction = truth) • pₑ = expected agreement by chance (if predictions were independent of truth)

The numerator measures how much better than chance you are. The denominator normalizes by the maximum possible improvement over chance.

Deriving expected agreement (pₑ):

If predictions and ground truth were statistically independent, the probability of both agreeing on class k would be:

P(pred=k, truth=k | independent) = P(pred=k) × P(truth=k)

For all classes:

pₑ = Σₖ P(pred=k) × P(truth=k)

These marginal probabilities are estimated from the confusion matrix:

P(pred=k) = (column k sum) / total = (TPₖ + Σⱼ≠ₖ FPⱼ→ₖ) / N
P(truth=k) = (row k sum) / total = (TPₖ + FNₖ) / N

Step-by-step computation:

Compute confusion matrix C where Cᵢⱼ = count(truth=i, pred=j)
N = total samples = ΣᵢΣⱼ Cᵢⱼ
pₒ = Σₖ Cₖₖ / N (diagonal sum / total)
For each class k: P(pred=k) = Σᵢ Cᵢₖ / N, P(truth=k) = Σⱼ Cₖⱼ / N
pₑ = Σₖ P(pred=k) × P(truth=k)
κ = (pₒ - pₑ) / (1 - pₑ)

cohens_kappa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from typing import Tuple
 
def compute_kappa_manual(y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, dict]:
    """
    Compute Cohen's kappa with full derivation details.
    
    Returns kappa and intermediate values for educational purposes.
    """
    # Build confusion matrix
    classes = np.unique(np.concatenate([y_true, y_pred]))
    K = len(classes)
    N = len(y_true)
    
    C = confusion_matrix(y_true, y_pred, labels=classes)
    
    # Observed agreement: diagonal sum / total
    p_o = np.trace(C) / N
    
    # Expected agreement under independence
    # P(pred=k) = column sum / N
    # P(truth=k) = row sum / N
    row_sums = C.sum(axis=1)  # P(truth=k) * N
    col_sums = C.sum(axis=0)  # P(pred=k) * N
    
    p_e = np.sum(row_sums * col_sums) / (N * N)
    
    # Cohen's kappa
    if p_e == 1.0:
        kappa = 1.0 if p_o == 1.0 else 0.0
    else:
        kappa = (p_o - p_e) / (1 - p_e)
    
    details = {
        'confusion_matrix': C,
        'n_samples': N,
        'n_classes': K,
        'observed_agreement': p_o,
        'expected_agreement': p_e,
        'improvement_over_chance': p_o - p_e,
        'max_possible_improvement': 1 - p_e,
        'kappa': kappa
    }
    
    return kappa, details
 
 
def compare_kappa_vs_accuracy():
    """
    Demonstrate when kappa differs significantly from accuracy.
    """
    # Scenario 1: Balanced dataset, good classifier
    y_true_balanced = np.array([0]*100 + [1]*100)
    y_pred_balanced = np.array([0]*90 + [1]*10 + [1]*85 + [0]*15)
    
    # Scenario 2: Imbalanced dataset, trivial classifier
    y_true_imbalanced = np.array([0]*180 + [1]*20)
    y_pred_trivial = np.array([0]*200)  # Always predict majority
    
    # Scenario 3: Imbalanced dataset, good classifier
    y_true_imbalanced2 = np.array([0]*180 + [1]*20)
    y_pred_good = np.array([0]*170 + [1]*10 + [1]*18 + [0]*2)
    
    scenarios = [
        ("Balanced + Good", y_true_balanced, y_pred_balanced),
        ("Imbalanced + Trivial", y_true_imbalanced, y_pred_trivial),
        ("Imbalanced + Discriminative", y_true_imbalanced2, y_pred_good),
    ]
    
    print("Kappa vs Accuracy Comparison:")
    print("=" * 65)
    print(f"{'Scenario':<28} {'Accuracy':<12} {'Kappa':<12} {'p_e':<12}")
    print("-" * 65)
    
    for name, y_t, y_p in scenarios:
        acc = np.mean(y_t == y_p)
        kappa, details = compute_kappa_manual(y_t, y_p)
        print(f"{name:<28} {acc:<12.3f} {kappa:<12.3f} {details['expected_agreement']:<12.3f}")
    
    print("-" * 65)
    print("Note: Trivial classifier has high accuracy but kappa ≈ 0")
 
 
if __name__ == "__main__":
    compare_kappa_vs_accuracy()

Interpretation Guidelines

Cohen's kappa ranges from -1 to +1, with distinct interpretive regions. Several widely-cited interpretation scales exist, though the original Landis & Koch (1977) scale remains most common.

Kappa Interpretation Scale (Landis & Koch, 1977)
Kappa Range	Interpretation	Practical Meaning
κ < 0	Less than chance	Classifier is systematically wrong (possibly inverted labels)
0.00 – 0.20	Slight agreement	Minimal practical utility; little better than guessing
0.21 – 0.40	Fair agreement	Some discriminative power, but unreliable
0.41 – 0.60	Moderate agreement	Reasonable for screening, insufficient for critical decisions
0.61 – 0.80	Substantial agreement	Good for most applications
0.81 – 1.00	Almost perfect agreement	Excellent; suitable for high-stakes decisions

Context Matters

These guidelines are heuristics, not absolute standards. A κ = 0.6 may be unacceptable for medical diagnosis but excellent for sentiment analysis. Always interpret kappa relative to:

• Task difficulty (harder tasks have lower expected kappa) • Baseline methods (compare to previous approaches) • Cost of errors (high-stakes applications need higher kappa)

Special kappa values:

κ = 1: Perfect agreement; classifier matches ground truth exactly
κ = 0: Agreement equals chance; no discriminative ability
κ < 0: Systematic disagreement; worse than random (often indicates label inversion or severe bias)

Relationship to accuracy:

κ = 0 corresponds to accuracy = pₑ (chance-level accuracy)
κ = 1 corresponds to accuracy = 1 (perfect accuracy)
κ < accuracy when class distribution is imbalanced
κ ≈ accuracy when classes are balanced and pₑ ≈ 0.5

When Kappa Beats Accuracy

The key advantage of kappa is its chance correction. This makes it particularly valuable in several scenarios where raw accuracy fails.

Kappa Advantages Over Accuracy

•Imbalanced datasets: A 95% accuracy with 95% class imbalance tells you nothing. Kappa reveals whether the model actually discriminates.
•Comparing across datasets: Two classifiers with 80% accuracy on different datasets aren't comparable. Kappa normalizes for baseline difficulty.
•Detecting trivial classifiers: A model that always predicts the majority class has κ ≈ 0 regardless of accuracy.
•Multi-rater agreement: When comparing multiple annotators or models, kappa provides a fair comparison independent of marginal distributions.
•Quality control in labeling: Kappa measures labeler consistency beyond what chance would produce.

kappa_advantages.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from sklearn.metrics import cohen_kappa_score, accuracy_score
 
def demonstrate_kappa_advantage():
    """Show situations where kappa provides critical insights."""
    
    # Scenario: Fraud detection (1% fraud rate)
    np.random.seed(42)
    n = 10000
    
    # Ground truth: 1% fraud
    y_true = np.array([0] * 9900 + [1] * 100)
    
    # Model A: Always predicts non-fraud (trivial)
    y_pred_trivial = np.zeros(n)
    
    # Model B: Catches 50% of fraud with 2% false positive rate
    y_pred_good = np.zeros(n)
    y_pred_good[9700:9900] = 1  # 200 false positives (2%)
    y_pred_good[9900:9950] = 1  # 50 true positives (50% recall)
    
    print("Fraud Detection Comparison (1% fraud rate):")
    print("=" * 55)
    print(f"{'Model':<25} {'Accuracy':<12} {'Kappa':<12}")
    print("-" * 55)
    
    acc_t = accuracy_score(y_true, y_pred_trivial)
    kap_t = cohen_kappa_score(y_true, y_pred_trivial)
    print(f"{'Trivial (always 0)':<25} {acc_t:<12.3f} {kap_t:<12.3f}")
    
    acc_g = accuracy_score(y_true, y_pred_good)
    kap_g = cohen_kappa_score(y_true, y_pred_good)
    print(f"{'Discriminative':<25} {acc_g:<12.3f} {kap_g:<12.3f}")
    
    print("-" * 55)
    print("
Interpretation:")
    print(f"  Trivial model: {acc_t*100:.1f}% accuracy but κ={kap_t:.3f} (useless)")
    print(f"  Good model: {acc_g*100:.1f}% accuracy with κ={kap_g:.3f} (discriminates)")
    print("  Accuracy difference: {:.1f}% (misleading)".format((acc_t - acc_g) * 100))
    print("  Kappa difference: {:.3f} (reveals true value)".format(kap_g - kap_t))
 
if __name__ == "__main__":
    demonstrate_kappa_advantage()

Multi-class Cohen's Kappa

Cohen's kappa extends naturally to multi-class problems. The same formula applies, but expected agreement now accounts for all K classes.

For K classes:

pₑ = Σₖ P(pred=k) × P(truth=k) = Σₖ (col_sumₖ/N) × (row_sumₖ/N)

The computation remains identical—just sum over more classes.

Weighted kappa for ordinal classes:

When classes have natural ordering (e.g., ratings 1-5), standard kappa treats all disagreements equally. Weighted kappa assigns different penalties based on distance:

Linear weights: wᵢⱼ = |i - j| / (K - 1)
Quadratic weights: wᵢⱼ = (i - j)² / (K - 1)²

Quadratic weighting penalizes large disagreements more heavily, making it preferred for ordinal scales.

weighted_kappa.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from sklearn.metrics import cohen_kappa_score
 
def compute_weighted_kappa(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    weights: str = 'quadratic'
) -> float:
    """
    Compute weighted Cohen's kappa for ordinal classification.
    
    Parameters
    ----------
    y_true, y_pred : array-like
        True and predicted labels
    weights : {'linear', 'quadratic'}
        Weighting scheme for disagreements
        
    Returns
    -------
    weighted_kappa : float
    """
    return cohen_kappa_score(y_true, y_pred, weights=weights)
 
 
def demonstrate_weighted_kappa():
    """Show effect of weighting on ordinal classification."""
    
    # Product rating prediction (1-5 stars)
    y_true = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5] * 20)
    
    # Prediction A: Off by 1 on most predictions
    y_pred_close = np.clip(y_true + np.random.choice([-1, 0, 1], len(y_true)), 1, 5)
    
    # Prediction B: Occasionally very wrong (1 predicted as 5)
    y_pred_extreme = y_true.copy()
    extreme_errors = np.random.choice(len(y_true), 20)
    y_pred_extreme[extreme_errors] = 6 - y_true[extreme_errors]  # Flip extremes
    
    print("Weighted Kappa for Ordinal Classification:")
    print("=" * 60)
    print(f"{'Prediction Type':<20} {'Unweighted':<15} {'Linear':<15} {'Quadratic':<15}")
    print("-" * 60)
    
    for name, y_p in [("Close errors", y_pred_close), ("Extreme errors", y_pred_extreme)]:
        k_none = cohen_kappa_score(y_true, y_p, weights=None)
        k_lin = cohen_kappa_score(y_true, y_p, weights='linear')
        k_quad = cohen_kappa_score(y_true, y_p, weights='quadratic')
        print(f"{name:<20} {k_none:<15.3f} {k_lin:<15.3f} {k_quad:<15.3f}")
    
    print("-" * 60)
    print("Note: Quadratic weights penalize extreme errors more severely")
 
if __name__ == "__main__":
    demonstrate_weighted_kappa()

Limitations and Alternatives

Despite its advantages, Cohen's kappa has known limitations that practitioners should understand.

Kappa Limitations

•Prevalence dependence: Kappa can be low even with high accuracy when prevalence is extreme, because pₑ becomes very high.
•Marginal distribution sensitivity: Two raters with the same accuracy but different prediction biases will have different kappas.
•Undefined for perfect agreement on rare classes: When a class has zero frequency, kappa computations can be unstable.
•Not directly comparable across studies: Kappa values depend on the specific marginal distributions in each study.
•Paradoxes possible: High kappa with low accuracy (or vice versa) can occur in pathological distributions.

Alternatives to consider:

Matthews Correlation Coefficient (MCC): Another chance-corrected metric; equivalent to kappa for binary classification but sometimes preferred
Balanced Accuracy: (Sensitivity + Specificity) / 2; simpler but less comprehensive
Macro-F1: Focuses on precision/recall rather than agreement
Fleiss' Kappa: Extension for multiple raters (> 2)
Krippendorff's Alpha: Handles missing data and multiple raters

Summary: Cohen's Kappa

Key Takeaways

•Kappa corrects for chance — It measures improvement over random agreement, revealing true discriminative ability.
•κ = (pₒ - pₑ) / (1 - pₑ) — Observed agreement minus expected, normalized by maximum possible improvement.
•Essential for imbalanced data — When classes are skewed, accuracy is misleading; kappa reveals trivial classifiers.
•Interpretation scales exist — Landis & Koch provide guidelines, but always consider task context.
•Weighted kappa for ordinal data — Use linear or quadratic weights when class ordering matters.
•Know the limitations — Kappa has paradoxes and marginal sensitivities; combine with other metrics.

Page Complete

3 / 5