Machine LearningMulti-class Metrics

Multi-class Classification Metrics

LevelIntermediate

Duration90 mins

TopicMulti-class Metrics

1 / 5

Micro vs Macro Averaging

The Multi-class Evaluation Challenge

When evaluating binary classifiers, metrics like precision, recall, and F1-score have unambiguous definitions. But what happens when your classifier must distinguish between three, ten, or even thousands of classes? How do you summarize model performance across diverse class categories into a single, interpretable number?

This is the fundamental challenge of multi-class evaluation, and the answer lies in averaging strategies. The choice between micro and macro averaging isn't merely technical—it fundamentally changes what your metric measures, whose errors matter most, and what optimization objectives your model ultimately pursues.

What You Will Master

By the end of this page, you will understand the mathematical foundations of micro and macro averaging, recognize when each approach is appropriate, diagnose situations where the two methods yield dramatically different results, and make principled decisions about evaluation strategy for real-world multi-class problems.

The Problem of Aggregation

Consider a sentiment classification system for product reviews that distinguishes between Positive, Neutral, and Negative sentiments. After evaluation on a test set, you obtain the following per-class metrics:

Class	Precision	Recall	F1-Score	Support
Positive	0.92	0.88	0.90	4,500
Neutral	0.67	0.72	0.69	500
Negative	0.85	0.79	0.82	1,000

Now, your stakeholder asks: "What's the model's overall precision?"

This seemingly simple question has no single correct answer. The three per-class precision values (0.92, 0.67, 0.85) must somehow be combined, but the method of combination encodes assumptions about what matters.

The Hidden Assumption

Every averaging strategy embeds an implicit weighting scheme. There is no 'neutral' or 'default' choice—each method prioritizes different aspects of performance. Understanding these implicit weights is crucial for making informed evaluation decisions.

The fundamental tension:

Instance-level view: Each prediction matters equally, regardless of class. A correct prediction on a rare class is worth exactly as much as a correct prediction on a common class.
Class-level view: Each class matters equally, regardless of frequency. Performance on minority classes should influence the overall metric as much as performance on majority classes.

Micro and macro averaging represent these two philosophies respectively, and choosing between them is a modeling decision with real consequences.

Macro Averaging: The Democratic Approach

Macro averaging treats all classes as equal citizens. It computes the metric independently for each class, then takes a simple (unweighted) arithmetic mean across classes. This approach gives every class exactly the same influence on the final score, regardless of how many samples belong to that class.

Mathematical Definition

For a metric M (such as precision, recall, or F1) across K classes:

Macro-M = (1/K) × Σᵢ Mᵢ

where Mᵢ is the metric computed for class i using only predictions and labels involving class i.

Computing macro-averaged metrics step by step:

Step 1: Compute per-class confusion matrix entries

For each class c, compute:

TPc: True positives for class c (predicted c, actually c)
FPc: False positives for class c (predicted c, not actually c)
FNc: False negatives for class c (not predicted c, actually c)
TNc: True negatives for class c (not predicted c, not actually c)

Step 2: Compute per-class metrics

For each class c:

Precisionc = TPc / (TPc + FPc)
Recallc = TPc / (TPc + FNc)
F1c = 2 × (Precisionc × Recallc) / (Precisionc + Recallc)

Step 3: Average across classes

Macro-Precision = (1/K) × Σc Precisionc
Macro-Recall = (1/K) × Σc Recallc
Macro-F1 = (1/K) × Σc F1c

macro_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
from typing import Tuple, List
 
def compute_macro_metrics(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float]:
    """
    Compute macro-averaged precision, recall, and F1 score.
    
    Macro averaging gives equal weight to each class, regardless of
    the number of samples in that class. This is particularly useful
    when class balance in the test set doesn't reflect production
    importance, or when minority class performance matters equally.
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Ground truth class labels
    y_pred : array-like of shape (n_samples,)
        Predicted class labels
        
    Returns
    -------
    macro_precision : float
        Macro-averaged precision
    macro_recall : float
        Macro-averaged recall
    macro_f1 : float
        Macro-averaged F1 score
        
    Mathematical Foundation
    -----------------------
    For K classes:
        Macro-M = (1/K) * sum(M_i for i in range(K))
    
    where M_i is the metric computed for class i.
    """
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='macro', zero_division=0
    )
    return precision, recall, f1
 
 
def compute_macro_metrics_manual(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float, dict]:
    """
    Manual implementation of macro averaging for educational purposes.
    
    This function explicitly shows the per-class computation and averaging
    process, making the mathematical structure transparent.
    """
    classes = np.unique(np.concatenate([y_true, y_pred]))
    n_classes = len(classes)
    
    per_class_metrics = {}
    
    for cls in classes:
        # Binary mask for current class
        true_positive = np.sum((y_pred == cls) & (y_true == cls))
        false_positive = np.sum((y_pred == cls) & (y_true != cls))
        false_negative = np.sum((y_pred != cls) & (y_true == cls))
        
        # Compute per-class metrics with zero division handling
        precision = true_positive / (true_positive + false_positive) \
                    if (true_positive + false_positive) > 0 else 0.0
        recall = true_positive / (true_positive + false_negative) \
                 if (true_positive + false_negative) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) \
             if (precision + recall) > 0 else 0.0
        
        per_class_metrics[cls] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'support': np.sum(y_true == cls)
        }
    
    # Simple arithmetic mean across classes
    macro_precision = np.mean([m['precision'] for m in per_class_metrics.values()])
    macro_recall = np.mean([m['recall'] for m in per_class_metrics.values()])
    macro_f1 = np.mean([m['f1'] for m in per_class_metrics.values()])
    
    return macro_precision, macro_recall, macro_f1, per_class_metrics
 
 
# Example: Demonstrating macro averaging
def demonstrate_macro_averaging():
    """
    Demonstration showing how macro averaging handles imbalanced data.
    """
    # Simulated predictions: highly imbalanced classes
    # Class 0: 900 samples, Class 1: 80 samples, Class 2: 20 samples
    np.random.seed(42)
    
    # Ground truth
    y_true = np.array(
        [0] * 900 + [1] * 80 + [2] * 20
    )
    
    # Predictions: model performs well on majority, poorly on minority
    y_pred = np.array(
        [0] * 850 + [1] * 30 + [2] * 20 +  # Class 0: 850/900 correct
        [1] * 50 + [0] * 20 + [2] * 10 +   # Class 1: 50/80 correct  
        [2] * 5 + [0] * 10 + [1] * 5       # Class 2: 5/20 correct
    )
    
    # Compute metrics
    macro_p, macro_r, macro_f1, per_class = compute_macro_metrics_manual(
        y_true, y_pred
    )
    
    print("Per-class metrics:")
    print("-" * 60)
    for cls, metrics in per_class.items():
        print(f"Class {cls}: P={metrics['precision']:.3f}, "
              f"R={metrics['recall']:.3f}, F1={metrics['f1']:.3f}, "
              f"Support={metrics['support']}")
    print("-" * 60)
    print(f"Macro-averaged: P={macro_p:.3f}, R={macro_r:.3f}, F1={macro_f1:.3f}")
    
    return per_class
 
if __name__ == "__main__":
    demonstrate_macro_averaging()

When Macro Averaging Shines

Use macro averaging when:

• Every class is equally important regardless of frequency (e.g., disease classification where rare diseases matter as much as common ones)

• Test set imbalance doesn't reflect production importance (e.g., your test set over-represents certain classes)

• You want to penalize poor minority class performance explicitly

• Fairness across groups is a priority (e.g., demographic categories in credit scoring)

Micro Averaging: The Instance-level Approach

Micro averaging aggregates contributions from all classes to compute a global metric. Rather than computing per-class metrics and then averaging, it pools all true positives, false positives, and false negatives across classes, then computes the metric on these aggregated counts.

This approach treats each instance equally—a correct prediction is worth exactly the same regardless of which class it belongs to.

Mathematical Definition

For precision, recall, and F1:

Micro-Precision = Σᵢ TPᵢ / Σᵢ (TPᵢ + FPᵢ)

Micro-Recall = Σᵢ TPᵢ / Σᵢ (TPᵢ + FNᵢ)

Micro-F1 = 2 × (Micro-P × Micro-R) / (Micro-P + Micro-R)

where the sums are over all K classes.

Critical insight: Micro-averaged metrics are dominated by majority classes

Because micro averaging pools counts across all classes, classes with more samples contribute proportionally more to the aggregated counts. If 90% of your test set belongs to class A, then class A's performance will account for approximately 90% of the micro-averaged metric.

The equivalence with accuracy:

In multi-class classification (where each sample has exactly one true label and one predicted label), an important result emerges:

Micro-Precision = Micro-Recall = Micro-F1 = Accuracy

This is because, in the single-label setting:

Total TP = Total correct predictions
Total FP = Total incorrect predictions (predicted wrong class)
Total FN = Total incorrect predictions (true class missed)

And since each sample contributes exactly one to either TP or (FP and FN simultaneously), we get:

ΣTPᵢ = correct predictions
Σ(TPᵢ + FPᵢ) = total predictions
Σ(TPᵢ + FNᵢ) = total samples

micro_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from typing import Tuple, Dict
 
def compute_micro_metrics(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float]:
    """
    Compute micro-averaged precision, recall, and F1 score.
    
    Micro averaging pools all samples together, treating each instance
    equally regardless of class. This gives a global measure of model
    performance weighted by class frequency.
    
    In single-label multi-class classification:
        Micro-Precision = Micro-Recall = Micro-F1 = Accuracy
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Ground truth class labels
    y_pred : array-like of shape (n_samples,)
        Predicted class labels
        
    Returns
    -------
    micro_precision : float
        Micro-averaged precision
    micro_recall : float  
        Micro-averaged recall
    micro_f1 : float
        Micro-averaged F1 score
    """
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='micro', zero_division=0
    )
    return precision, recall, f1
 
 
def compute_micro_metrics_manual(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float, Dict]:
    """
    Manual implementation of micro averaging for educational purposes.
    
    This function explicitly shows the pooling of counts across classes,
    demonstrating why micro metrics are dominated by majority classes.
    """
    classes = np.unique(np.concatenate([y_true, y_pred]))
    
    # Aggregate counts across all classes
    total_tp = 0
    total_fp = 0
    total_fn = 0
    
    class_details = {}
    
    for cls in classes:
        tp = np.sum((y_pred == cls) & (y_true == cls))
        fp = np.sum((y_pred == cls) & (y_true != cls))
        fn = np.sum((y_pred != cls) & (y_true == cls))
        
        total_tp += tp
        total_fp += fp
        total_fn += fn
        
        class_details[cls] = {'tp': tp, 'fp': fp, 'fn': fn}
    
    # Compute metrics from aggregated counts
    micro_precision = total_tp / (total_tp + total_fp) \
                      if (total_tp + total_fp) > 0 else 0.0
    micro_recall = total_tp / (total_tp + total_fn) \
                   if (total_tp + total_fn) > 0 else 0.0
    micro_f1 = 2 * micro_precision * micro_recall / \
               (micro_precision + micro_recall) \
               if (micro_precision + micro_recall) > 0 else 0.0
    
    aggregated = {
        'total_tp': total_tp,
        'total_fp': total_fp, 
        'total_fn': total_fn,
        'per_class': class_details
    }
    
    return micro_precision, micro_recall, micro_f1, aggregated
 
 
def demonstrate_micro_accuracy_equivalence():
    """
    Demonstrate that micro-F1 equals accuracy in single-label classification.
    """
    np.random.seed(42)
    
    # Generate random multi-class predictions
    n_samples = 1000
    n_classes = 5
    
    y_true = np.random.randint(0, n_classes, n_samples)
    y_pred = y_true.copy()
    # Introduce some errors
    error_idx = np.random.choice(n_samples, 200, replace=False)
    y_pred[error_idx] = np.random.randint(0, n_classes, len(error_idx))
    
    # Compute metrics
    accuracy = accuracy_score(y_true, y_pred)
    micro_p, micro_r, micro_f1, _ = compute_micro_metrics_manual(y_true, y_pred)
    
    print("Demonstrating Micro-F1 = Accuracy equivalence:")
    print("-" * 50)
    print(f"Accuracy:         {accuracy:.6f}")
    print(f"Micro-Precision:  {micro_p:.6f}")
    print(f"Micro-Recall:     {micro_r:.6f}")
    print(f"Micro-F1:         {micro_f1:.6f}")
    print("-" * 50)
    print(f"Difference (Micro-F1 - Accuracy): {abs(micro_f1 - accuracy):.10f}")
    
    return accuracy, micro_f1
 
 
def show_majority_class_dominance():
    """
    Illustrate how micro averaging is dominated by majority classes.
    """
    # Extreme imbalance: 950 samples from class 0, 25 each from classes 1 and 2
    y_true = np.array([0] * 950 + [1] * 25 + [2] * 25)
    
    # Model: perfect on class 0, terrible on minority classes
    y_pred = np.array(
        [0] * 950 +      # Class 0: 100% correct
        [0] * 25 +       # Class 1: 0% correct (all predicted as 0)
        [0] * 25         # Class 2: 0% correct (all predicted as 0)
    )
    
    # Compute both averaging methods
    micro_p, micro_r, micro_f1, _ = compute_micro_metrics_manual(y_true, y_pred)
    
    from compute_macro_metrics_manual import compute_macro_metrics_manual
    macro_p, macro_r, macro_f1, per_class = compute_macro_metrics_manual(y_true, y_pred)
    
    print("Majority Class Dominance in Micro Averaging:")
    print("=" * 60)
    print("Scenario: 95% class 0, 2.5% class 1, 2.5% class 2")
    print("Model: Perfect on class 0, completely fails on classes 1 and 2")
    print("-" * 60)
    print(f"Micro-averaged metrics (dominated by class 0):")
    print(f"  Precision: {micro_p:.3f}")
    print(f"  Recall:    {micro_r:.3f}")
    print(f"  F1:        {micro_f1:.3f}")
    print("-" * 60)
    print(f"Macro-averaged metrics (equal weight to all classes):")
    print(f"  Precision: {macro_p:.3f}")
    print(f"  Recall:    {macro_r:.3f}")
    print(f"  F1:        {macro_f1:.3f}")
    print("-" * 60)
    print("Observation: Micro shows 0.95 F1 (looks great!)")
    print("             Macro shows 0.32 F1 (reveals minority class failure)")
 
if __name__ == "__main__":
    demonstrate_micro_accuracy_equivalence()

When Micro Averaging Shines

Use micro averaging when:

• Class frequency in the test set reflects production importance (e.g., high-volume classifications where common cases matter most)

• Overall throughput accuracy matters (e.g., spam filtering where total correct classifications drive user experience)

• You want an interpretable 'overall accuracy' equivalent for stakeholder communication

• Multi-label classification where samples can have multiple labels (micro averaging is more meaningful here)

Comparative Analysis: When Methods Diverge

The difference between micro and macro averaging becomes most pronounced when class distributions are imbalanced and model performance varies across classes. Understanding when and why these methods diverge is essential for proper model evaluation.

Micro vs Macro Averaging: Comparative Properties
Property	Micro Averaging	Macro Averaging
Weighting scheme	Implicit weight by class frequency	Equal weight for all classes
Sensitivity to class imbalance	Dominated by majority classes	Equally sensitive to all classes
Relationship to accuracy	Equal to accuracy (single-label)	Not directly related to accuracy
Minority class influence	Minimal if minority is small	Equal influence regardless of size
Interpretability	'Overall' performance	'Average class' performance
Use in imbalanced data	May hide minority class failures	Reveals minority class issues
Computational complexity	O(n) single pass possible	O(n × K) or O(n) with per-class tracking

Case study: Divergence in practice

Consider a medical diagnosis system classifying X-ray images into three categories:

Normal (85% of cases)
Condition A (10% of cases)
Condition B (5% of cases)

A model achieves:

Normal: 0.95 precision, 0.98 recall
Condition A: 0.70 precision, 0.60 recall
Condition B: 0.50 precision, 0.30 recall

Micro-averaged F1: Approximately 0.92 (dominated by Normal's excellent performance)

Macro-averaged F1: Approximately 0.65 (reveals poor minority class performance)

Which metric tells the true story? It depends on the application context:

If the goal is to triage patients efficiently (most are healthy), the micro metric suggests the system works well overall.
If missing any disease is critical (false negatives are dangerous), the macro metric reveals that Condition B detection is severely inadequate.
A deployment decision based solely on micro-F1 might approve a dangerous system; macro-F1 would flag the issue.

The Danger of Single-Metric Optimization

A model optimizing for micro-averaged metrics can achieve high scores by focusing exclusively on majority classes while ignoring minorities. Always examine per-class metrics alongside aggregated metrics to detect this failure mode.

divergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support
 
def analyze_metric_divergence(
    max_imbalance_ratio: int = 100,
    majority_performance: float = 0.95,
    minority_performance: float = 0.50,
    n_minority_samples: int = 100
) -> dict:
    """
    Analyze how micro and macro metrics diverge as class imbalance increases.
    
    This function systematically varies the imbalance ratio while keeping
    per-class performance constant, showing how micro averaging becomes
    increasingly insensitive to minority class failures.
    
    Parameters
    ----------
    max_imbalance_ratio : int
        Maximum ratio of majority to minority class samples
    majority_performance : float
        Accuracy on majority class (held constant)
    minority_performance : float
        Accuracy on minority class (held constant)
    n_minority_samples : int
        Fixed number of minority class samples
        
    Returns
    -------
    results : dict
        Contains imbalance ratios and corresponding metric values
    """
    imbalance_ratios = list(range(1, max_imbalance_ratio + 1, 5))
    
    micro_f1_scores = []
    macro_f1_scores = []
    
    for ratio in imbalance_ratios:
        n_majority = n_minority_samples * ratio
        
        # Generate ground truth
        y_true = np.array([0] * n_majority + [1] * n_minority_samples)
        
        # Generate predictions with fixed per-class performance
        n_correct_majority = int(n_majority * majority_performance)
        n_correct_minority = int(n_minority_samples * minority_performance)
        
        y_pred_majority = np.array(
            [0] * n_correct_majority + [1] * (n_majority - n_correct_majority)
        )
        y_pred_minority = np.array(
            [1] * n_correct_minority + [0] * (n_minority_samples - n_correct_minority)
        )
        y_pred = np.concatenate([y_pred_majority, y_pred_minority])
        
        # Compute metrics
        _, _, micro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='micro', zero_division=0
        )
        _, _, macro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='macro', zero_division=0
        )
        
        micro_f1_scores.append(micro_f1)
        macro_f1_scores.append(macro_f1)
    
    results = {
        'imbalance_ratios': imbalance_ratios,
        'micro_f1': micro_f1_scores,
        'macro_f1': macro_f1_scores,
        'divergence': [m - M for m, M in zip(micro_f1_scores, macro_f1_scores)]
    }
    
    return results
 
 
def visualize_divergence(results: dict):
    """
    Visualize the divergence between micro and macro averaging.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Metric values
    ax1 = axes[0]
    ax1.plot(results['imbalance_ratios'], results['micro_f1'], 
             'b-', linewidth=2, label='Micro F1')
    ax1.plot(results['imbalance_ratios'], results['macro_f1'], 
             'r-', linewidth=2, label='Macro F1')
    ax1.set_xlabel('Imbalance Ratio (Majority/Minority)', fontsize=12)
    ax1.set_ylabel('F1 Score', fontsize=12)
    ax1.set_title('Metric Values vs Class Imbalance', fontsize=14)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, 1)
    
    # Plot 2: Divergence
    ax2 = axes[1]
    ax2.fill_between(results['imbalance_ratios'], 0, results['divergence'],
                     alpha=0.3, color='purple')
    ax2.plot(results['imbalance_ratios'], results['divergence'], 
             'purple', linewidth=2)
    ax2.set_xlabel('Imbalance Ratio (Majority/Minority)', fontsize=12)
    ax2.set_ylabel('Micro F1 - Macro F1', fontsize=12)
    ax2.set_title('Divergence Between Averaging Methods', fontsize=14)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('metric_divergence.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    return fig
 
 
if __name__ == "__main__":
    results = analyze_metric_divergence(
        max_imbalance_ratio=100,
        majority_performance=0.95,
        minority_performance=0.50
    )
    
    print("Divergence Analysis Results:")
    print("-" * 50)
    for ratio, micro, macro in zip(
        results['imbalance_ratios'][:10],
        results['micro_f1'][:10],
        results['macro_f1'][:10]
    ):
        print(f"Ratio {ratio:3d}:1 | Micro: {micro:.3f} | Macro: {macro:.3f} | "
              f"Gap: {micro - macro:.3f}")
    
    visualize_divergence(results)

Mathematical Properties and Bounds

Understanding the mathematical relationships between micro and macro averaging provides deeper insight into their behavior and helps predict when they will diverge significantly.

Key Mathematical Relationships

Bound relationship: Macro-averaged metrics are bounded by the range of per-class metrics:

min(Mᵢ) ≤ Macro-M ≤ max(Mᵢ)

Weighted representation: Micro averaging can be expressed as a weighted macro average:

Micro-M = Σᵢ wᵢ × Mᵢ

where wᵢ = (TPᵢ + FPᵢ) / Σⱼ(TPⱼ + FPⱼ) for precision, and wᵢ = (TPᵢ + FNᵢ) / Σⱼ(TPⱼ + FNⱼ) for recall.

Equality condition: Micro = Macro when all classes have equal performance or equal sample sizes.

Analyzing the deviation:

The gap between micro and macro averaging can be decomposed:

Micro - Macro = Σᵢ (wᵢ - 1/K) × Mᵢ

where:

wᵢ is the implicit weight in micro averaging
1/K is the uniform weight in macro averaging
Mᵢ is the per-class metric

This reveals that the gap depends on:

Weight deviation (wᵢ - 1/K): How much class weights differ from uniform
Performance correlation: Whether high-weight classes have high or low Mᵢ

Maximum divergence occurs when:

Class sizes are highly imbalanced (large |wᵢ - 1/K|)
Larger classes have better performance (positive correlation)

Minimum divergence occurs when:

Classes are balanced
Performance is uniform across classes

mathematical_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from typing import List, Tuple
 
def decompose_micro_macro_gap(
    per_class_metrics: List[float],
    per_class_weights: List[float]
) -> Tuple[float, float, float, np.ndarray]:
    """
    Decompose the gap between micro and macro averaging.
    
    The gap can be written as:
        Micro - Macro = sum_i (w_i - 1/K) * M_i
    
    This function analyzes the contribution of each class to the gap.
    
    Parameters
    ----------
    per_class_metrics : list of float
        Metric value for each class (e.g., precision or F1)
    per_class_weights : list of float
        Implicit weight of each class in micro averaging
        
    Returns
    -------
    micro : float
        Micro-averaged metric
    macro : float
        Macro-averaged metric
    gap : float
        Difference (micro - macro)
    contributions : ndarray
        Per-class contribution to the gap
    """
    K = len(per_class_metrics)
    M = np.array(per_class_metrics)
    W = np.array(per_class_weights)
    
    # Normalize weights to sum to 1
    W = W / W.sum()
    
    # Compute averages
    micro = np.sum(W * M)
    macro = np.mean(M)
    
    # Decompose gap
    weight_deviations = W - 1/K
    contributions = weight_deviations * M
    gap = np.sum(contributions)
    
    return micro, macro, gap, contributions
 
 
def analyze_gap_contributors():
    """
    Illustrate which classes contribute to micro-macro divergence.
    """
    # Example: 5 classes with varying sizes and performances
    per_class_f1 = [0.95, 0.88, 0.72, 0.55, 0.40]  # Performance
    per_class_size = [10000, 5000, 2000, 500, 100]  # Class sizes
    
    # Weights are proportional to size (simplified for F1)
    weights = np.array(per_class_size, dtype=float)
    
    micro, macro, gap, contributions = decompose_micro_macro_gap(
        per_class_f1, weights
    )
    
    print("Gap Decomposition Analysis:")
    print("=" * 65)
    print(f"{'Class':<8} {'Size':<10} {'Weight':<10} {'F1':<8} {'Contribution':<12}")
    print("-" * 65)
    
    norm_weights = weights / weights.sum()
    for i, (f1, size, w, c) in enumerate(zip(
        per_class_f1, per_class_size, norm_weights, contributions
    )):
        print(f"{i:<8} {size:<10} {w:<10.4f} {f1:<8.3f} {c:+.4f}")
    
    print("-" * 65)
    print(f"Micro F1: {micro:.4f}")
    print(f"Macro F1: {macro:.4f}")
    print(f"Gap (Micro - Macro): {gap:+.4f}")
    print("-" * 65)
    print("\nInterpretation:")
    print(f"  - Large positive contributions from classes {np.argmax(contributions)}")
    print(f"  - Large negative contributions from classes {np.argmin(contributions)}")
    print(f"  - Micro > Macro because large classes have high performance")
 
 
def prove_micro_bounds():
    """
    Demonstrate that micro averaging is bounded by weighted extremes.
    """
    # For any set of per-class metrics and weights
    np.random.seed(42)
    
    n_trials = 1000
    bounds_satisfied = 0
    
    for _ in range(n_trials):
        K = np.random.randint(3, 10)
        weights = np.random.dirichlet(np.ones(K))
        metrics = np.random.uniform(0, 1, K)
        
        micro = np.sum(weights * metrics)
        
        # Check bounds
        lower_bound = np.min(metrics)
        upper_bound = np.max(metrics)
        
        if lower_bound <= micro <= upper_bound:
            bounds_satisfied += 1
    
    print(f"\nBounds verification ({n_trials} trials):")
    print(f"  Micro within [min(M_i), max(M_i)]: {bounds_satisfied}/{n_trials} "
          f"({100*bounds_satisfied/n_trials:.1f}%)")
    
    return bounds_satisfied == n_trials
 
if __name__ == "__main__":
    analyze_gap_contributors()
    prove_micro_bounds()

Practical Decision Framework

Choosing between micro and macro averaging requires understanding your domain context, stakeholder requirements, and the specific failure modes you need to detect. Here's a structured framework for making this decision.

Decision Criteria

•
Question 1: Does test set frequency match production importance?
- Yes → Micro averaging may be appropriate
- No → Consider macro or weighted averaging
•
Question 2: Is minority class failure acceptable?
- Yes (e.g., spam with 1% false negatives is OK) → Micro is fine
- No (e.g., missing rare diseases is critical) → Use macro
•
Question 3: How will stakeholders interpret the metric?
- 'Overall accuracy-like' understanding → Micro (more intuitive)
- 'Average performance per category' → Macro
•
Question 4: Are you comparing models?
- Same evaluation set → Either works if consistent
- Different imbalance levels → Macro provides fairer comparison
•
Question 5: Is this a multi-label problem?
- Yes → Micro is more meaningful (measures label-level accuracy)
- No (single-label) → Both are valid, choose by context

Best Practice: Report Both

In academic papers and internal reports, always report both micro and macro metrics. The gap between them is itself informative—a large gap indicates either class imbalance, performance variation across classes, or both. Report per-class metrics for complete transparency.

Averaging Method Selection by Use Case
Use Case	Recommended Method	Rationale
Spam detection	Micro	Total throughput matters; rare spam is less critical
Medical diagnosis	Macro	Each disease equally important regardless of prevalence
Document classification	Depends	Topic distribution may or may not match importance
Sentiment analysis	Macro	Usually want balanced detection of all sentiments
Multi-label tagging	Micro	Per-label accuracy is the natural unit
Fraud detection	Macro or weighted	Rare fraud class must not be ignored
Named entity recognition	Macro	All entity types should be detected well
Image classification (balanced)	Either	When balanced, they approximate each other

Implementation Best Practices

Correct implementation of averaging strategies requires attention to edge cases, proper handling of undefined metrics, and integration with existing evaluation pipelines.

production_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
import numpy as np
from sklearn.metrics import classification_report, precision_recall_fscore_support
from typing import Dict, Any, Optional
import json
 
class MultiClassEvaluator:
    """
    Production-grade multi-class evaluation with comprehensive averaging.
    
    This class provides a complete evaluation framework that computes
    all averaging methods, handles edge cases, and produces stakeholder-
    friendly reports.
    """
    
    def __init__(
        self,
        class_names: Optional[Dict[int, str]] = None,
        zero_division: float = 0.0
    ):
        """
        Initialize the evaluator.
        
        Parameters
        ----------
        class_names : dict, optional
            Mapping from class index to human-readable name
        zero_division : float
            Value to return when a metric is undefined (division by zero)
        """
        self.class_names = class_names or {}
        self.zero_division = zero_division
        
    def evaluate(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_prob: Optional[np.ndarray] = None
    ) -> Dict[str, Any]:
        """
        Perform comprehensive multi-class evaluation.
        
        Returns metrics at all levels: per-class, micro, macro, and weighted.
        
        Parameters
        ----------
        y_true : ndarray
            Ground truth labels
        y_pred : ndarray
            Predicted labels
        y_prob : ndarray, optional
            Predicted probabilities for each class
            
        Returns
        -------
        report : dict
            Complete evaluation report
        """
        classes = np.unique(np.concatenate([y_true, y_pred]))
        n_classes = len(classes)
        n_samples = len(y_true)
        
        # Per-class metrics
        per_class = {}
        for cls in classes:
            name = self.class_names.get(cls, f"Class_{cls}")
            
            tp = np.sum((y_pred == cls) & (y_true == cls))
            fp = np.sum((y_pred == cls) & (y_true != cls))
            fn = np.sum((y_pred != cls) & (y_true == cls))
            support = np.sum(y_true == cls)
            
            precision = self._safe_divide(tp, tp + fp)
            recall = self._safe_divide(tp, tp + fn)
            f1 = self._safe_divide(2 * precision * recall, precision + recall)
            
            per_class[name] = {
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'support': int(support),
                'true_positives': int(tp),
                'false_positives': int(fp),
                'false_negatives': int(fn)
            }
        
        # Aggregated metrics
        micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='micro', zero_division=self.zero_division
        )
        macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='macro', zero_division=self.zero_division
        )
        weighted_p, weighted_r, weighted_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='weighted', zero_division=self.zero_division
        )
        
        accuracy = np.mean(y_true == y_pred)
        
        # Compute gap as diagnostic
        micro_macro_gap = micro_f1 - macro_f1
        
        report = {
            'summary': {
                'n_samples': n_samples,
                'n_classes': n_classes,
                'accuracy': float(accuracy),
                'micro_macro_gap': float(micro_macro_gap),
                'class_balance': self._compute_balance_metrics(y_true, classes)
            },
            'micro': {
                'precision': float(micro_p),
                'recall': float(micro_r),
                'f1': float(micro_f1)
            },
            'macro': {
                'precision': float(macro_p),
                'recall': float(macro_r),
                'f1': float(macro_f1)
            },
            'weighted': {
                'precision': float(weighted_p),
                'recall': float(weighted_r),
                'f1': float(weighted_f1)
            },
            'per_class': per_class
        }
        
        return report
    
    def _safe_divide(self, numerator: float, denominator: float) -> float:
        """Safe division with zero handling."""
        return numerator / denominator if denominator > 0 else self.zero_division
    
    def _compute_balance_metrics(
        self,
        y_true: np.ndarray,
        classes: np.ndarray
    ) -> Dict[str, float]:
        """Compute class balance statistics."""
        counts = [np.sum(y_true == c) for c in classes]
        total = len(y_true)
        
        imbalance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf')
        
        # Gini coefficient of class distribution
        proportions = np.array(counts) / total
        gini = 1 - np.sum(proportions ** 2)
        
        return {
            'imbalance_ratio': float(imbalance_ratio),
            'normalized_entropy': float(self._normalized_entropy(proportions)),
            'gini_coefficient': float(gini)
        }
    
    def _normalized_entropy(self, proportions: np.ndarray) -> float:
        """Compute normalized entropy of class distribution."""
        proportions = proportions[proportions > 0]  # Avoid log(0)
        entropy = -np.sum(proportions * np.log(proportions))
        max_entropy = np.log(len(proportions))
        return entropy / max_entropy if max_entropy > 0 else 0.0
    
    def format_report(self, report: Dict[str, Any]) -> str:
        """Format report as human-readable string."""
        lines = []
        lines.append("=" * 70)
        lines.append("MULTI-CLASS CLASSIFICATION REPORT")
        lines.append("=" * 70)
        
        # Summary
        s = report['summary']
        lines.append(f"\nSamples: {s['n_samples']:,} | Classes: {s['n_classes']}")
        lines.append(f"Accuracy: {s['accuracy']:.4f}")
        lines.append(f"Micro-Macro Gap: {s['micro_macro_gap']:+.4f}")
        lines.append(f"Imbalance Ratio: {s['class_balance']['imbalance_ratio']:.2f}")
        
        # Aggregated metrics
        lines.append("\n" + "-" * 70)
        lines.append(f"{'Averaging':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}")
        lines.append("-" * 70)
        
        for avg in ['micro', 'macro', 'weighted']:
            m = report[avg]
            lines.append(f"{avg.capitalize():<12} {m['precision']:<12.4f} "
                        f"{m['recall']:<12.4f} {m['f1']:<12.4f}")
        
        # Per-class
        lines.append("\n" + "-" * 70)
        lines.append(f"{'Class':<20} {'P':<8} {'R':<8} {'F1':<8} {'Support':<10}")
        lines.append("-" * 70)
        
        for name, m in report['per_class'].items():
            lines.append(f"{name:<20} {m['precision']:<8.4f} {m['recall']:<8.4f} "
                        f"{m['f1']:<8.4f} {m['support']:<10}")
        
        lines.append("=" * 70)
        
        return "\n".join(lines)
 
 
# Example usage
def demonstrate_evaluator():
    np.random.seed(42)
    
    # Simulate imbalanced multi-class problem
    y_true = np.concatenate([
        np.zeros(800),
        np.ones(150),
        np.full(50, 2)
    ]).astype(int)
    
    y_pred = y_true.copy()
    # Add errors
    y_pred[:80] = 1  # Some class 0 predicted as 1
    y_pred[800:820] = 0  # Some class 1 predicted as 0
    y_pred[950:] = 1  # All class 2 predicted as 1
    
    evaluator = MultiClassEvaluator(
        class_names={0: 'Negative', 1: 'Neutral', 2: 'Positive'}
    )
    
    report = evaluator.evaluate(y_true, y_pred)
    print(evaluator.format_report(report))
    
    return report
 
if __name__ == "__main__":
    demonstrate_evaluator()

Handling Undefined Metrics

When a class has no predictions (all FP and TP = 0), precision is undefined. When a class has no true samples (FN and TP = 0), recall is undefined. The zero_division parameter controls behavior in these cases:

• zero_division=0: Return 0 (conservative, may lower averages) • zero_division=1: Return 1 (optimistic, treats undefined as perfect) • zero_division='warn': Return 0 but emit a warning

For macro averaging, undefined per-class metrics significantly impact the average. Document your choice.

Summary: Micro vs Macro Averaging

We've comprehensively explored the two fundamental approaches to aggregating multi-class metrics. Let's consolidate the key insights:

Key Takeaways

•Micro averaging pools instances — It aggregates TP, FP, FN across all classes, then computes metrics. Classes with more samples dominate. In single-label classification, micro-F1 equals accuracy.
•Macro averaging pools classes — It computes metrics per-class, then takes a simple mean. Every class has equal influence regardless of size.
•Neither is 'correct' — They encode different value judgments about what matters. The choice depends on whether you prioritize overall throughput or equal class treatment.
•Divergence diagnoses issues — A large gap between micro and macro signals either class imbalance, performance variation, or both. This gap is itself a useful diagnostic.
•Report both in practice — Transparency requires showing both perspectives, plus per-class breakdowns for full accountability.
•Context determines selection — Medical diagnosis typically needs macro (rare diseases matter); spam filtering typically uses micro (total accuracy matters).

What's next:

Micro and macro averaging represent the extremes of uniform weighting (by instance vs. by class). In the next page, we'll explore weighted averaging, which provides a middle ground by allowing explicit, customized importance weights for each class. This approach enables you to align evaluation metrics directly with business priorities.

Page Complete

You now understand the mathematical foundations, practical implications, and decision frameworks for micro and macro averaging. You can analyze when these methods diverge, diagnose what the divergence indicates, and make principled choices about which averaging strategy fits your application context.

1 / 5

Loading learning content...

Machine LearningMulti-class Metrics

Multi-class Classification Metrics

LevelIntermediate

Duration90 mins

TopicMulti-class Metrics

1 / 5

Micro vs Macro Averaging

The Multi-class Evaluation Challenge

What You Will Master

The Problem of Aggregation

Class	Precision	Recall	F1-Score	Support
Positive	0.92	0.88	0.90	4,500
Neutral	0.67	0.72	0.69	500
Negative	0.85	0.79	0.82	1,000

Now, your stakeholder asks: "What's the model's overall precision?"

The Hidden Assumption

The fundamental tension:

Instance-level view: Each prediction matters equally, regardless of class. A correct prediction on a rare class is worth exactly as much as a correct prediction on a common class.
Class-level view: Each class matters equally, regardless of frequency. Performance on minority classes should influence the overall metric as much as performance on majority classes.

Micro and macro averaging represent these two philosophies respectively, and choosing between them is a modeling decision with real consequences.

Macro Averaging: The Democratic Approach

Mathematical Definition

For a metric M (such as precision, recall, or F1) across K classes:

Macro-M = (1/K) × Σᵢ Mᵢ

where Mᵢ is the metric computed for class i using only predictions and labels involving class i.

Computing macro-averaged metrics step by step:

Step 1: Compute per-class confusion matrix entries

For each class c, compute:

TPc: True positives for class c (predicted c, actually c)
FPc: False positives for class c (predicted c, not actually c)
FNc: False negatives for class c (not predicted c, actually c)
TNc: True negatives for class c (not predicted c, not actually c)

Step 2: Compute per-class metrics

For each class c:

Precisionc = TPc / (TPc + FPc)
Recallc = TPc / (TPc + FNc)
F1c = 2 × (Precisionc × Recallc) / (Precisionc + Recallc)

Step 3: Average across classes

Macro-Precision = (1/K) × Σc Precisionc
Macro-Recall = (1/K) × Σc Recallc
Macro-F1 = (1/K) × Σc F1c

macro_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
from typing import Tuple, List
 
def compute_macro_metrics(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float]:
    """
    Compute macro-averaged precision, recall, and F1 score.
    
    Macro averaging gives equal weight to each class, regardless of
    the number of samples in that class. This is particularly useful
    when class balance in the test set doesn't reflect production
    importance, or when minority class performance matters equally.
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Ground truth class labels
    y_pred : array-like of shape (n_samples,)
        Predicted class labels
        
    Returns
    -------
    macro_precision : float
        Macro-averaged precision
    macro_recall : float
        Macro-averaged recall
    macro_f1 : float
        Macro-averaged F1 score
        
    Mathematical Foundation
    -----------------------
    For K classes:
        Macro-M = (1/K) * sum(M_i for i in range(K))
    
    where M_i is the metric computed for class i.
    """
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='macro', zero_division=0
    )
    return precision, recall, f1
 
 
def compute_macro_metrics_manual(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float, dict]:
    """
    Manual implementation of macro averaging for educational purposes.
    
    This function explicitly shows the per-class computation and averaging
    process, making the mathematical structure transparent.
    """
    classes = np.unique(np.concatenate([y_true, y_pred]))
    n_classes = len(classes)
    
    per_class_metrics = {}
    
    for cls in classes:
        # Binary mask for current class
        true_positive = np.sum((y_pred == cls) & (y_true == cls))
        false_positive = np.sum((y_pred == cls) & (y_true != cls))
        false_negative = np.sum((y_pred != cls) & (y_true == cls))
        
        # Compute per-class metrics with zero division handling
        precision = true_positive / (true_positive + false_positive) \
                    if (true_positive + false_positive) > 0 else 0.0
        recall = true_positive / (true_positive + false_negative) \
                 if (true_positive + false_negative) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) \
             if (precision + recall) > 0 else 0.0
        
        per_class_metrics[cls] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'support': np.sum(y_true == cls)
        }
    
    # Simple arithmetic mean across classes
    macro_precision = np.mean([m['precision'] for m in per_class_metrics.values()])
    macro_recall = np.mean([m['recall'] for m in per_class_metrics.values()])
    macro_f1 = np.mean([m['f1'] for m in per_class_metrics.values()])
    
    return macro_precision, macro_recall, macro_f1, per_class_metrics
 
 
# Example: Demonstrating macro averaging
def demonstrate_macro_averaging():
    """
    Demonstration showing how macro averaging handles imbalanced data.
    """
    # Simulated predictions: highly imbalanced classes
    # Class 0: 900 samples, Class 1: 80 samples, Class 2: 20 samples
    np.random.seed(42)
    
    # Ground truth
    y_true = np.array(
        [0] * 900 + [1] * 80 + [2] * 20
    )
    
    # Predictions: model performs well on majority, poorly on minority
    y_pred = np.array(
        [0] * 850 + [1] * 30 + [2] * 20 +  # Class 0: 850/900 correct
        [1] * 50 + [0] * 20 + [2] * 10 +   # Class 1: 50/80 correct  
        [2] * 5 + [0] * 10 + [1] * 5       # Class 2: 5/20 correct
    )
    
    # Compute metrics
    macro_p, macro_r, macro_f1, per_class = compute_macro_metrics_manual(
        y_true, y_pred
    )
    
    print("Per-class metrics:")
    print("-" * 60)
    for cls, metrics in per_class.items():
        print(f"Class {cls}: P={metrics['precision']:.3f}, "
              f"R={metrics['recall']:.3f}, F1={metrics['f1']:.3f}, "
              f"Support={metrics['support']}")
    print("-" * 60)
    print(f"Macro-averaged: P={macro_p:.3f}, R={macro_r:.3f}, F1={macro_f1:.3f}")
    
    return per_class
 
if __name__ == "__main__":
    demonstrate_macro_averaging()

When Macro Averaging Shines

Use macro averaging when:

• Every class is equally important regardless of frequency (e.g., disease classification where rare diseases matter as much as common ones)

• Test set imbalance doesn't reflect production importance (e.g., your test set over-represents certain classes)

• You want to penalize poor minority class performance explicitly

• Fairness across groups is a priority (e.g., demographic categories in credit scoring)

Micro Averaging: The Instance-level Approach

This approach treats each instance equally—a correct prediction is worth exactly the same regardless of which class it belongs to.

Mathematical Definition

For precision, recall, and F1:

Micro-Precision = Σᵢ TPᵢ / Σᵢ (TPᵢ + FPᵢ)

Micro-Recall = Σᵢ TPᵢ / Σᵢ (TPᵢ + FNᵢ)

Micro-F1 = 2 × (Micro-P × Micro-R) / (Micro-P + Micro-R)

where the sums are over all K classes.

Critical insight: Micro-averaged metrics are dominated by majority classes

The equivalence with accuracy:

In multi-class classification (where each sample has exactly one true label and one predicted label), an important result emerges:

Micro-Precision = Micro-Recall = Micro-F1 = Accuracy

This is because, in the single-label setting:

Total TP = Total correct predictions
Total FP = Total incorrect predictions (predicted wrong class)
Total FN = Total incorrect predictions (true class missed)

And since each sample contributes exactly one to either TP or (FP and FN simultaneously), we get:

ΣTPᵢ = correct predictions
Σ(TPᵢ + FPᵢ) = total predictions
Σ(TPᵢ + FNᵢ) = total samples

micro_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from typing import Tuple, Dict
 
def compute_micro_metrics(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float]:
    """
    Compute micro-averaged precision, recall, and F1 score.
    
    Micro averaging pools all samples together, treating each instance
    equally regardless of class. This gives a global measure of model
    performance weighted by class frequency.
    
    In single-label multi-class classification:
        Micro-Precision = Micro-Recall = Micro-F1 = Accuracy
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Ground truth class labels
    y_pred : array-like of shape (n_samples,)
        Predicted class labels
        
    Returns
    -------
    micro_precision : float
        Micro-averaged precision
    micro_recall : float  
        Micro-averaged recall
    micro_f1 : float
        Micro-averaged F1 score
    """
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='micro', zero_division=0
    )
    return precision, recall, f1
 
 
def compute_micro_metrics_manual(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> Tuple[float, float, float, Dict]:
    """
    Manual implementation of micro averaging for educational purposes.
    
    This function explicitly shows the pooling of counts across classes,
    demonstrating why micro metrics are dominated by majority classes.
    """
    classes = np.unique(np.concatenate([y_true, y_pred]))
    
    # Aggregate counts across all classes
    total_tp = 0
    total_fp = 0
    total_fn = 0
    
    class_details = {}
    
    for cls in classes:
        tp = np.sum((y_pred == cls) & (y_true == cls))
        fp = np.sum((y_pred == cls) & (y_true != cls))
        fn = np.sum((y_pred != cls) & (y_true == cls))
        
        total_tp += tp
        total_fp += fp
        total_fn += fn
        
        class_details[cls] = {'tp': tp, 'fp': fp, 'fn': fn}
    
    # Compute metrics from aggregated counts
    micro_precision = total_tp / (total_tp + total_fp) \
                      if (total_tp + total_fp) > 0 else 0.0
    micro_recall = total_tp / (total_tp + total_fn) \
                   if (total_tp + total_fn) > 0 else 0.0
    micro_f1 = 2 * micro_precision * micro_recall / \
               (micro_precision + micro_recall) \
               if (micro_precision + micro_recall) > 0 else 0.0
    
    aggregated = {
        'total_tp': total_tp,
        'total_fp': total_fp, 
        'total_fn': total_fn,
        'per_class': class_details
    }
    
    return micro_precision, micro_recall, micro_f1, aggregated
 
 
def demonstrate_micro_accuracy_equivalence():
    """
    Demonstrate that micro-F1 equals accuracy in single-label classification.
    """
    np.random.seed(42)
    
    # Generate random multi-class predictions
    n_samples = 1000
    n_classes = 5
    
    y_true = np.random.randint(0, n_classes, n_samples)
    y_pred = y_true.copy()
    # Introduce some errors
    error_idx = np.random.choice(n_samples, 200, replace=False)
    y_pred[error_idx] = np.random.randint(0, n_classes, len(error_idx))
    
    # Compute metrics
    accuracy = accuracy_score(y_true, y_pred)
    micro_p, micro_r, micro_f1, _ = compute_micro_metrics_manual(y_true, y_pred)
    
    print("Demonstrating Micro-F1 = Accuracy equivalence:")
    print("-" * 50)
    print(f"Accuracy:         {accuracy:.6f}")
    print(f"Micro-Precision:  {micro_p:.6f}")
    print(f"Micro-Recall:     {micro_r:.6f}")
    print(f"Micro-F1:         {micro_f1:.6f}")
    print("-" * 50)
    print(f"Difference (Micro-F1 - Accuracy): {abs(micro_f1 - accuracy):.10f}")
    
    return accuracy, micro_f1
 
 
def show_majority_class_dominance():
    """
    Illustrate how micro averaging is dominated by majority classes.
    """
    # Extreme imbalance: 950 samples from class 0, 25 each from classes 1 and 2
    y_true = np.array([0] * 950 + [1] * 25 + [2] * 25)
    
    # Model: perfect on class 0, terrible on minority classes
    y_pred = np.array(
        [0] * 950 +      # Class 0: 100% correct
        [0] * 25 +       # Class 1: 0% correct (all predicted as 0)
        [0] * 25         # Class 2: 0% correct (all predicted as 0)
    )
    
    # Compute both averaging methods
    micro_p, micro_r, micro_f1, _ = compute_micro_metrics_manual(y_true, y_pred)
    
    from compute_macro_metrics_manual import compute_macro_metrics_manual
    macro_p, macro_r, macro_f1, per_class = compute_macro_metrics_manual(y_true, y_pred)
    
    print("Majority Class Dominance in Micro Averaging:")
    print("=" * 60)
    print("Scenario: 95% class 0, 2.5% class 1, 2.5% class 2")
    print("Model: Perfect on class 0, completely fails on classes 1 and 2")
    print("-" * 60)
    print(f"Micro-averaged metrics (dominated by class 0):")
    print(f"  Precision: {micro_p:.3f}")
    print(f"  Recall:    {micro_r:.3f}")
    print(f"  F1:        {micro_f1:.3f}")
    print("-" * 60)
    print(f"Macro-averaged metrics (equal weight to all classes):")
    print(f"  Precision: {macro_p:.3f}")
    print(f"  Recall:    {macro_r:.3f}")
    print(f"  F1:        {macro_f1:.3f}")
    print("-" * 60)
    print("Observation: Micro shows 0.95 F1 (looks great!)")
    print("             Macro shows 0.32 F1 (reveals minority class failure)")
 
if __name__ == "__main__":
    demonstrate_micro_accuracy_equivalence()

When Micro Averaging Shines

Use micro averaging when:

• Class frequency in the test set reflects production importance (e.g., high-volume classifications where common cases matter most)

• Overall throughput accuracy matters (e.g., spam filtering where total correct classifications drive user experience)

• You want an interpretable 'overall accuracy' equivalent for stakeholder communication

• Multi-label classification where samples can have multiple labels (micro averaging is more meaningful here)

Comparative Analysis: When Methods Diverge

Micro vs Macro Averaging: Comparative Properties
Property	Micro Averaging	Macro Averaging
Weighting scheme	Implicit weight by class frequency	Equal weight for all classes
Sensitivity to class imbalance	Dominated by majority classes	Equally sensitive to all classes
Relationship to accuracy	Equal to accuracy (single-label)	Not directly related to accuracy
Minority class influence	Minimal if minority is small	Equal influence regardless of size
Interpretability	'Overall' performance	'Average class' performance
Use in imbalanced data	May hide minority class failures	Reveals minority class issues
Computational complexity	O(n) single pass possible	O(n × K) or O(n) with per-class tracking

Case study: Divergence in practice

Consider a medical diagnosis system classifying X-ray images into three categories:

Normal (85% of cases)
Condition A (10% of cases)
Condition B (5% of cases)

A model achieves:

Normal: 0.95 precision, 0.98 recall
Condition A: 0.70 precision, 0.60 recall
Condition B: 0.50 precision, 0.30 recall

Micro-averaged F1: Approximately 0.92 (dominated by Normal's excellent performance)

Macro-averaged F1: Approximately 0.65 (reveals poor minority class performance)

Which metric tells the true story? It depends on the application context:

If the goal is to triage patients efficiently (most are healthy), the micro metric suggests the system works well overall.
If missing any disease is critical (false negatives are dangerous), the macro metric reveals that Condition B detection is severely inadequate.
A deployment decision based solely on micro-F1 might approve a dangerous system; macro-F1 would flag the issue.

The Danger of Single-Metric Optimization

divergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support
 
def analyze_metric_divergence(
    max_imbalance_ratio: int = 100,
    majority_performance: float = 0.95,
    minority_performance: float = 0.50,
    n_minority_samples: int = 100
) -> dict:
    """
    Analyze how micro and macro metrics diverge as class imbalance increases.
    
    This function systematically varies the imbalance ratio while keeping
    per-class performance constant, showing how micro averaging becomes
    increasingly insensitive to minority class failures.
    
    Parameters
    ----------
    max_imbalance_ratio : int
        Maximum ratio of majority to minority class samples
    majority_performance : float
        Accuracy on majority class (held constant)
    minority_performance : float
        Accuracy on minority class (held constant)
    n_minority_samples : int
        Fixed number of minority class samples
        
    Returns
    -------
    results : dict
        Contains imbalance ratios and corresponding metric values
    """
    imbalance_ratios = list(range(1, max_imbalance_ratio + 1, 5))
    
    micro_f1_scores = []
    macro_f1_scores = []
    
    for ratio in imbalance_ratios:
        n_majority = n_minority_samples * ratio
        
        # Generate ground truth
        y_true = np.array([0] * n_majority + [1] * n_minority_samples)
        
        # Generate predictions with fixed per-class performance
        n_correct_majority = int(n_majority * majority_performance)
        n_correct_minority = int(n_minority_samples * minority_performance)
        
        y_pred_majority = np.array(
            [0] * n_correct_majority + [1] * (n_majority - n_correct_majority)
        )
        y_pred_minority = np.array(
            [1] * n_correct_minority + [0] * (n_minority_samples - n_correct_minority)
        )
        y_pred = np.concatenate([y_pred_majority, y_pred_minority])
        
        # Compute metrics
        _, _, micro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='micro', zero_division=0
        )
        _, _, macro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='macro', zero_division=0
        )
        
        micro_f1_scores.append(micro_f1)
        macro_f1_scores.append(macro_f1)
    
    results = {
        'imbalance_ratios': imbalance_ratios,
        'micro_f1': micro_f1_scores,
        'macro_f1': macro_f1_scores,
        'divergence': [m - M for m, M in zip(micro_f1_scores, macro_f1_scores)]
    }
    
    return results
 
 
def visualize_divergence(results: dict):
    """
    Visualize the divergence between micro and macro averaging.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Metric values
    ax1 = axes[0]
    ax1.plot(results['imbalance_ratios'], results['micro_f1'], 
             'b-', linewidth=2, label='Micro F1')
    ax1.plot(results['imbalance_ratios'], results['macro_f1'], 
             'r-', linewidth=2, label='Macro F1')
    ax1.set_xlabel('Imbalance Ratio (Majority/Minority)', fontsize=12)
    ax1.set_ylabel('F1 Score', fontsize=12)
    ax1.set_title('Metric Values vs Class Imbalance', fontsize=14)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, 1)
    
    # Plot 2: Divergence
    ax2 = axes[1]
    ax2.fill_between(results['imbalance_ratios'], 0, results['divergence'],
                     alpha=0.3, color='purple')
    ax2.plot(results['imbalance_ratios'], results['divergence'], 
             'purple', linewidth=2)
    ax2.set_xlabel('Imbalance Ratio (Majority/Minority)', fontsize=12)
    ax2.set_ylabel('Micro F1 - Macro F1', fontsize=12)
    ax2.set_title('Divergence Between Averaging Methods', fontsize=14)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('metric_divergence.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    return fig
 
 
if __name__ == "__main__":
    results = analyze_metric_divergence(
        max_imbalance_ratio=100,
        majority_performance=0.95,
        minority_performance=0.50
    )
    
    print("Divergence Analysis Results:")
    print("-" * 50)
    for ratio, micro, macro in zip(
        results['imbalance_ratios'][:10],
        results['micro_f1'][:10],
        results['macro_f1'][:10]
    ):
        print(f"Ratio {ratio:3d}:1 | Micro: {micro:.3f} | Macro: {macro:.3f} | "
              f"Gap: {micro - macro:.3f}")
    
    visualize_divergence(results)

Mathematical Properties and Bounds

Understanding the mathematical relationships between micro and macro averaging provides deeper insight into their behavior and helps predict when they will diverge significantly.

Key Mathematical Relationships

Bound relationship: Macro-averaged metrics are bounded by the range of per-class metrics:

min(Mᵢ) ≤ Macro-M ≤ max(Mᵢ)

Weighted representation: Micro averaging can be expressed as a weighted macro average:

Micro-M = Σᵢ wᵢ × Mᵢ

where wᵢ = (TPᵢ + FPᵢ) / Σⱼ(TPⱼ + FPⱼ) for precision, and wᵢ = (TPᵢ + FNᵢ) / Σⱼ(TPⱼ + FNⱼ) for recall.

Equality condition: Micro = Macro when all classes have equal performance or equal sample sizes.

Analyzing the deviation:

The gap between micro and macro averaging can be decomposed:

Micro - Macro = Σᵢ (wᵢ - 1/K) × Mᵢ

where:

wᵢ is the implicit weight in micro averaging
1/K is the uniform weight in macro averaging
Mᵢ is the per-class metric

This reveals that the gap depends on:

Weight deviation (wᵢ - 1/K): How much class weights differ from uniform
Performance correlation: Whether high-weight classes have high or low Mᵢ

Maximum divergence occurs when:

Class sizes are highly imbalanced (large |wᵢ - 1/K|)
Larger classes have better performance (positive correlation)

Minimum divergence occurs when:

Classes are balanced
Performance is uniform across classes

mathematical_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from typing import List, Tuple
 
def decompose_micro_macro_gap(
    per_class_metrics: List[float],
    per_class_weights: List[float]
) -> Tuple[float, float, float, np.ndarray]:
    """
    Decompose the gap between micro and macro averaging.
    
    The gap can be written as:
        Micro - Macro = sum_i (w_i - 1/K) * M_i
    
    This function analyzes the contribution of each class to the gap.
    
    Parameters
    ----------
    per_class_metrics : list of float
        Metric value for each class (e.g., precision or F1)
    per_class_weights : list of float
        Implicit weight of each class in micro averaging
        
    Returns
    -------
    micro : float
        Micro-averaged metric
    macro : float
        Macro-averaged metric
    gap : float
        Difference (micro - macro)
    contributions : ndarray
        Per-class contribution to the gap
    """
    K = len(per_class_metrics)
    M = np.array(per_class_metrics)
    W = np.array(per_class_weights)
    
    # Normalize weights to sum to 1
    W = W / W.sum()
    
    # Compute averages
    micro = np.sum(W * M)
    macro = np.mean(M)
    
    # Decompose gap
    weight_deviations = W - 1/K
    contributions = weight_deviations * M
    gap = np.sum(contributions)
    
    return micro, macro, gap, contributions
 
 
def analyze_gap_contributors():
    """
    Illustrate which classes contribute to micro-macro divergence.
    """
    # Example: 5 classes with varying sizes and performances
    per_class_f1 = [0.95, 0.88, 0.72, 0.55, 0.40]  # Performance
    per_class_size = [10000, 5000, 2000, 500, 100]  # Class sizes
    
    # Weights are proportional to size (simplified for F1)
    weights = np.array(per_class_size, dtype=float)
    
    micro, macro, gap, contributions = decompose_micro_macro_gap(
        per_class_f1, weights
    )
    
    print("Gap Decomposition Analysis:")
    print("=" * 65)
    print(f"{'Class':<8} {'Size':<10} {'Weight':<10} {'F1':<8} {'Contribution':<12}")
    print("-" * 65)
    
    norm_weights = weights / weights.sum()
    for i, (f1, size, w, c) in enumerate(zip(
        per_class_f1, per_class_size, norm_weights, contributions
    )):
        print(f"{i:<8} {size:<10} {w:<10.4f} {f1:<8.3f} {c:+.4f}")
    
    print("-" * 65)
    print(f"Micro F1: {micro:.4f}")
    print(f"Macro F1: {macro:.4f}")
    print(f"Gap (Micro - Macro): {gap:+.4f}")
    print("-" * 65)
    print("\nInterpretation:")
    print(f"  - Large positive contributions from classes {np.argmax(contributions)}")
    print(f"  - Large negative contributions from classes {np.argmin(contributions)}")
    print(f"  - Micro > Macro because large classes have high performance")
 
 
def prove_micro_bounds():
    """
    Demonstrate that micro averaging is bounded by weighted extremes.
    """
    # For any set of per-class metrics and weights
    np.random.seed(42)
    
    n_trials = 1000
    bounds_satisfied = 0
    
    for _ in range(n_trials):
        K = np.random.randint(3, 10)
        weights = np.random.dirichlet(np.ones(K))
        metrics = np.random.uniform(0, 1, K)
        
        micro = np.sum(weights * metrics)
        
        # Check bounds
        lower_bound = np.min(metrics)
        upper_bound = np.max(metrics)
        
        if lower_bound <= micro <= upper_bound:
            bounds_satisfied += 1
    
    print(f"\nBounds verification ({n_trials} trials):")
    print(f"  Micro within [min(M_i), max(M_i)]: {bounds_satisfied}/{n_trials} "
          f"({100*bounds_satisfied/n_trials:.1f}%)")
    
    return bounds_satisfied == n_trials
 
if __name__ == "__main__":
    analyze_gap_contributors()
    prove_micro_bounds()

Practical Decision Framework

Decision Criteria

•
Question 1: Does test set frequency match production importance?
- Yes → Micro averaging may be appropriate
- No → Consider macro or weighted averaging
•
Question 2: Is minority class failure acceptable?
- Yes (e.g., spam with 1% false negatives is OK) → Micro is fine
- No (e.g., missing rare diseases is critical) → Use macro
•
Question 3: How will stakeholders interpret the metric?
- 'Overall accuracy-like' understanding → Micro (more intuitive)
- 'Average performance per category' → Macro
•
Question 4: Are you comparing models?
- Same evaluation set → Either works if consistent
- Different imbalance levels → Macro provides fairer comparison
•
Question 5: Is this a multi-label problem?
- Yes → Micro is more meaningful (measures label-level accuracy)
- No (single-label) → Both are valid, choose by context

Best Practice: Report Both

Averaging Method Selection by Use Case
Use Case	Recommended Method	Rationale
Spam detection	Micro	Total throughput matters; rare spam is less critical
Medical diagnosis	Macro	Each disease equally important regardless of prevalence
Document classification	Depends	Topic distribution may or may not match importance
Sentiment analysis	Macro	Usually want balanced detection of all sentiments
Multi-label tagging	Micro	Per-label accuracy is the natural unit
Fraud detection	Macro or weighted	Rare fraud class must not be ignored
Named entity recognition	Macro	All entity types should be detected well
Image classification (balanced)	Either	When balanced, they approximate each other

Implementation Best Practices

Correct implementation of averaging strategies requires attention to edge cases, proper handling of undefined metrics, and integration with existing evaluation pipelines.

production_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
import numpy as np
from sklearn.metrics import classification_report, precision_recall_fscore_support
from typing import Dict, Any, Optional
import json
 
class MultiClassEvaluator:
    """
    Production-grade multi-class evaluation with comprehensive averaging.
    
    This class provides a complete evaluation framework that computes
    all averaging methods, handles edge cases, and produces stakeholder-
    friendly reports.
    """
    
    def __init__(
        self,
        class_names: Optional[Dict[int, str]] = None,
        zero_division: float = 0.0
    ):
        """
        Initialize the evaluator.
        
        Parameters
        ----------
        class_names : dict, optional
            Mapping from class index to human-readable name
        zero_division : float
            Value to return when a metric is undefined (division by zero)
        """
        self.class_names = class_names or {}
        self.zero_division = zero_division
        
    def evaluate(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_prob: Optional[np.ndarray] = None
    ) -> Dict[str, Any]:
        """
        Perform comprehensive multi-class evaluation.
        
        Returns metrics at all levels: per-class, micro, macro, and weighted.
        
        Parameters
        ----------
        y_true : ndarray
            Ground truth labels
        y_pred : ndarray
            Predicted labels
        y_prob : ndarray, optional
            Predicted probabilities for each class
            
        Returns
        -------
        report : dict
            Complete evaluation report
        """
        classes = np.unique(np.concatenate([y_true, y_pred]))
        n_classes = len(classes)
        n_samples = len(y_true)
        
        # Per-class metrics
        per_class = {}
        for cls in classes:
            name = self.class_names.get(cls, f"Class_{cls}")
            
            tp = np.sum((y_pred == cls) & (y_true == cls))
            fp = np.sum((y_pred == cls) & (y_true != cls))
            fn = np.sum((y_pred != cls) & (y_true == cls))
            support = np.sum(y_true == cls)
            
            precision = self._safe_divide(tp, tp + fp)
            recall = self._safe_divide(tp, tp + fn)
            f1 = self._safe_divide(2 * precision * recall, precision + recall)
            
            per_class[name] = {
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'support': int(support),
                'true_positives': int(tp),
                'false_positives': int(fp),
                'false_negatives': int(fn)
            }
        
        # Aggregated metrics
        micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='micro', zero_division=self.zero_division
        )
        macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='macro', zero_division=self.zero_division
        )
        weighted_p, weighted_r, weighted_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='weighted', zero_division=self.zero_division
        )
        
        accuracy = np.mean(y_true == y_pred)
        
        # Compute gap as diagnostic
        micro_macro_gap = micro_f1 - macro_f1
        
        report = {
            'summary': {
                'n_samples': n_samples,
                'n_classes': n_classes,
                'accuracy': float(accuracy),
                'micro_macro_gap': float(micro_macro_gap),
                'class_balance': self._compute_balance_metrics(y_true, classes)
            },
            'micro': {
                'precision': float(micro_p),
                'recall': float(micro_r),
                'f1': float(micro_f1)
            },
            'macro': {
                'precision': float(macro_p),
                'recall': float(macro_r),
                'f1': float(macro_f1)
            },
            'weighted': {
                'precision': float(weighted_p),
                'recall': float(weighted_r),
                'f1': float(weighted_f1)
            },
            'per_class': per_class
        }
        
        return report
    
    def _safe_divide(self, numerator: float, denominator: float) -> float:
        """Safe division with zero handling."""
        return numerator / denominator if denominator > 0 else self.zero_division
    
    def _compute_balance_metrics(
        self,
        y_true: np.ndarray,
        classes: np.ndarray
    ) -> Dict[str, float]:
        """Compute class balance statistics."""
        counts = [np.sum(y_true == c) for c in classes]
        total = len(y_true)
        
        imbalance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf')
        
        # Gini coefficient of class distribution
        proportions = np.array(counts) / total
        gini = 1 - np.sum(proportions ** 2)
        
        return {
            'imbalance_ratio': float(imbalance_ratio),
            'normalized_entropy': float(self._normalized_entropy(proportions)),
            'gini_coefficient': float(gini)
        }
    
    def _normalized_entropy(self, proportions: np.ndarray) -> float:
        """Compute normalized entropy of class distribution."""
        proportions = proportions[proportions > 0]  # Avoid log(0)
        entropy = -np.sum(proportions * np.log(proportions))
        max_entropy = np.log(len(proportions))
        return entropy / max_entropy if max_entropy > 0 else 0.0
    
    def format_report(self, report: Dict[str, Any]) -> str:
        """Format report as human-readable string."""
        lines = []
        lines.append("=" * 70)
        lines.append("MULTI-CLASS CLASSIFICATION REPORT")
        lines.append("=" * 70)
        
        # Summary
        s = report['summary']
        lines.append(f"\nSamples: {s['n_samples']:,} | Classes: {s['n_classes']}")
        lines.append(f"Accuracy: {s['accuracy']:.4f}")
        lines.append(f"Micro-Macro Gap: {s['micro_macro_gap']:+.4f}")
        lines.append(f"Imbalance Ratio: {s['class_balance']['imbalance_ratio']:.2f}")
        
        # Aggregated metrics
        lines.append("\n" + "-" * 70)
        lines.append(f"{'Averaging':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}")
        lines.append("-" * 70)
        
        for avg in ['micro', 'macro', 'weighted']:
            m = report[avg]
            lines.append(f"{avg.capitalize():<12} {m['precision']:<12.4f} "
                        f"{m['recall']:<12.4f} {m['f1']:<12.4f}")
        
        # Per-class
        lines.append("\n" + "-" * 70)
        lines.append(f"{'Class':<20} {'P':<8} {'R':<8} {'F1':<8} {'Support':<10}")
        lines.append("-" * 70)
        
        for name, m in report['per_class'].items():
            lines.append(f"{name:<20} {m['precision']:<8.4f} {m['recall']:<8.4f} "
                        f"{m['f1']:<8.4f} {m['support']:<10}")
        
        lines.append("=" * 70)
        
        return "\n".join(lines)
 
 
# Example usage
def demonstrate_evaluator():
    np.random.seed(42)
    
    # Simulate imbalanced multi-class problem
    y_true = np.concatenate([
        np.zeros(800),
        np.ones(150),
        np.full(50, 2)
    ]).astype(int)
    
    y_pred = y_true.copy()
    # Add errors
    y_pred[:80] = 1  # Some class 0 predicted as 1
    y_pred[800:820] = 0  # Some class 1 predicted as 0
    y_pred[950:] = 1  # All class 2 predicted as 1
    
    evaluator = MultiClassEvaluator(
        class_names={0: 'Negative', 1: 'Neutral', 2: 'Positive'}
    )
    
    report = evaluator.evaluate(y_true, y_pred)
    print(evaluator.format_report(report))
    
    return report
 
if __name__ == "__main__":
    demonstrate_evaluator()

Handling Undefined Metrics

For macro averaging, undefined per-class metrics significantly impact the average. Document your choice.

Summary: Micro vs Macro Averaging

We've comprehensively explored the two fundamental approaches to aggregating multi-class metrics. Let's consolidate the key insights:

Key Takeaways

•Micro averaging pools instances — It aggregates TP, FP, FN across all classes, then computes metrics. Classes with more samples dominate. In single-label classification, micro-F1 equals accuracy.
•Macro averaging pools classes — It computes metrics per-class, then takes a simple mean. Every class has equal influence regardless of size.
•Neither is 'correct' — They encode different value judgments about what matters. The choice depends on whether you prioritize overall throughput or equal class treatment.
•Divergence diagnoses issues — A large gap between micro and macro signals either class imbalance, performance variation, or both. This gap is itself a useful diagnostic.
•Report both in practice — Transparency requires showing both perspectives, plus per-class breakdowns for full accountability.
•Context determines selection — Medical diagnosis typically needs macro (rare diseases matter); spam filtering typically uses micro (total accuracy matters).

What's next:

Page Complete

1 / 5