Loading learning content...
When evaluating binary classifiers, metrics like precision, recall, and F1-score have unambiguous definitions. But what happens when your classifier must distinguish between three, ten, or even thousands of classes? How do you summarize model performance across diverse class categories into a single, interpretable number?
This is the fundamental challenge of multi-class evaluation, and the answer lies in averaging strategies. The choice between micro and macro averaging isn't merely technical—it fundamentally changes what your metric measures, whose errors matter most, and what optimization objectives your model ultimately pursues.
By the end of this page, you will understand the mathematical foundations of micro and macro averaging, recognize when each approach is appropriate, diagnose situations where the two methods yield dramatically different results, and make principled decisions about evaluation strategy for real-world multi-class problems.
Consider a sentiment classification system for product reviews that distinguishes between Positive, Neutral, and Negative sentiments. After evaluation on a test set, you obtain the following per-class metrics:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Positive | 0.92 | 0.88 | 0.90 | 4,500 |
| Neutral | 0.67 | 0.72 | 0.69 | 500 |
| Negative | 0.85 | 0.79 | 0.82 | 1,000 |
Now, your stakeholder asks: "What's the model's overall precision?"
This seemingly simple question has no single correct answer. The three per-class precision values (0.92, 0.67, 0.85) must somehow be combined, but the method of combination encodes assumptions about what matters.
Every averaging strategy embeds an implicit weighting scheme. There is no 'neutral' or 'default' choice—each method prioritizes different aspects of performance. Understanding these implicit weights is crucial for making informed evaluation decisions.
The fundamental tension:
Instance-level view: Each prediction matters equally, regardless of class. A correct prediction on a rare class is worth exactly as much as a correct prediction on a common class.
Class-level view: Each class matters equally, regardless of frequency. Performance on minority classes should influence the overall metric as much as performance on majority classes.
Micro and macro averaging represent these two philosophies respectively, and choosing between them is a modeling decision with real consequences.
Macro averaging treats all classes as equal citizens. It computes the metric independently for each class, then takes a simple (unweighted) arithmetic mean across classes. This approach gives every class exactly the same influence on the final score, regardless of how many samples belong to that class.
For a metric M (such as precision, recall, or F1) across K classes:
Macro-M = (1/K) × Σᵢ Mᵢ
where Mᵢ is the metric computed for class i using only predictions and labels involving class i.
Computing macro-averaged metrics step by step:
Step 1: Compute per-class confusion matrix entries
For each class c, compute:
Step 2: Compute per-class metrics
For each class c:
Step 3: Average across classes
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom sklearn.metrics import precision_recall_fscore_supportfrom typing import Tuple, List def compute_macro_metrics( y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, float, float]: """ Compute macro-averaged precision, recall, and F1 score. Macro averaging gives equal weight to each class, regardless of the number of samples in that class. This is particularly useful when class balance in the test set doesn't reflect production importance, or when minority class performance matters equally. Parameters ---------- y_true : array-like of shape (n_samples,) Ground truth class labels y_pred : array-like of shape (n_samples,) Predicted class labels Returns ------- macro_precision : float Macro-averaged precision macro_recall : float Macro-averaged recall macro_f1 : float Macro-averaged F1 score Mathematical Foundation ----------------------- For K classes: Macro-M = (1/K) * sum(M_i for i in range(K)) where M_i is the metric computed for class i. """ precision, recall, f1, _ = precision_recall_fscore_support( y_true, y_pred, average='macro', zero_division=0 ) return precision, recall, f1 def compute_macro_metrics_manual( y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, float, float, dict]: """ Manual implementation of macro averaging for educational purposes. This function explicitly shows the per-class computation and averaging process, making the mathematical structure transparent. """ classes = np.unique(np.concatenate([y_true, y_pred])) n_classes = len(classes) per_class_metrics = {} for cls in classes: # Binary mask for current class true_positive = np.sum((y_pred == cls) & (y_true == cls)) false_positive = np.sum((y_pred == cls) & (y_true != cls)) false_negative = np.sum((y_pred != cls) & (y_true == cls)) # Compute per-class metrics with zero division handling precision = true_positive / (true_positive + false_positive) \ if (true_positive + false_positive) > 0 else 0.0 recall = true_positive / (true_positive + false_negative) \ if (true_positive + false_negative) > 0 else 0.0 f1 = 2 * precision * recall / (precision + recall) \ if (precision + recall) > 0 else 0.0 per_class_metrics[cls] = { 'precision': precision, 'recall': recall, 'f1': f1, 'support': np.sum(y_true == cls) } # Simple arithmetic mean across classes macro_precision = np.mean([m['precision'] for m in per_class_metrics.values()]) macro_recall = np.mean([m['recall'] for m in per_class_metrics.values()]) macro_f1 = np.mean([m['f1'] for m in per_class_metrics.values()]) return macro_precision, macro_recall, macro_f1, per_class_metrics # Example: Demonstrating macro averagingdef demonstrate_macro_averaging(): """ Demonstration showing how macro averaging handles imbalanced data. """ # Simulated predictions: highly imbalanced classes # Class 0: 900 samples, Class 1: 80 samples, Class 2: 20 samples np.random.seed(42) # Ground truth y_true = np.array( [0] * 900 + [1] * 80 + [2] * 20 ) # Predictions: model performs well on majority, poorly on minority y_pred = np.array( [0] * 850 + [1] * 30 + [2] * 20 + # Class 0: 850/900 correct [1] * 50 + [0] * 20 + [2] * 10 + # Class 1: 50/80 correct [2] * 5 + [0] * 10 + [1] * 5 # Class 2: 5/20 correct ) # Compute metrics macro_p, macro_r, macro_f1, per_class = compute_macro_metrics_manual( y_true, y_pred ) print("Per-class metrics:") print("-" * 60) for cls, metrics in per_class.items(): print(f"Class {cls}: P={metrics['precision']:.3f}, " f"R={metrics['recall']:.3f}, F1={metrics['f1']:.3f}, " f"Support={metrics['support']}") print("-" * 60) print(f"Macro-averaged: P={macro_p:.3f}, R={macro_r:.3f}, F1={macro_f1:.3f}") return per_class if __name__ == "__main__": demonstrate_macro_averaging()Use macro averaging when:
• Every class is equally important regardless of frequency (e.g., disease classification where rare diseases matter as much as common ones)
• Test set imbalance doesn't reflect production importance (e.g., your test set over-represents certain classes)
• You want to penalize poor minority class performance explicitly
• Fairness across groups is a priority (e.g., demographic categories in credit scoring)
Micro averaging aggregates contributions from all classes to compute a global metric. Rather than computing per-class metrics and then averaging, it pools all true positives, false positives, and false negatives across classes, then computes the metric on these aggregated counts.
This approach treats each instance equally—a correct prediction is worth exactly the same regardless of which class it belongs to.
For precision, recall, and F1:
Micro-Precision = Σᵢ TPᵢ / Σᵢ (TPᵢ + FPᵢ)
Micro-Recall = Σᵢ TPᵢ / Σᵢ (TPᵢ + FNᵢ)
Micro-F1 = 2 × (Micro-P × Micro-R) / (Micro-P + Micro-R)
where the sums are over all K classes.
Critical insight: Micro-averaged metrics are dominated by majority classes
Because micro averaging pools counts across all classes, classes with more samples contribute proportionally more to the aggregated counts. If 90% of your test set belongs to class A, then class A's performance will account for approximately 90% of the micro-averaged metric.
The equivalence with accuracy:
In multi-class classification (where each sample has exactly one true label and one predicted label), an important result emerges:
Micro-Precision = Micro-Recall = Micro-F1 = Accuracy
This is because, in the single-label setting:
And since each sample contributes exactly one to either TP or (FP and FN simultaneously), we get:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom sklearn.metrics import precision_recall_fscore_support, accuracy_scorefrom typing import Tuple, Dict def compute_micro_metrics( y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, float, float]: """ Compute micro-averaged precision, recall, and F1 score. Micro averaging pools all samples together, treating each instance equally regardless of class. This gives a global measure of model performance weighted by class frequency. In single-label multi-class classification: Micro-Precision = Micro-Recall = Micro-F1 = Accuracy Parameters ---------- y_true : array-like of shape (n_samples,) Ground truth class labels y_pred : array-like of shape (n_samples,) Predicted class labels Returns ------- micro_precision : float Micro-averaged precision micro_recall : float Micro-averaged recall micro_f1 : float Micro-averaged F1 score """ precision, recall, f1, _ = precision_recall_fscore_support( y_true, y_pred, average='micro', zero_division=0 ) return precision, recall, f1 def compute_micro_metrics_manual( y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, float, float, Dict]: """ Manual implementation of micro averaging for educational purposes. This function explicitly shows the pooling of counts across classes, demonstrating why micro metrics are dominated by majority classes. """ classes = np.unique(np.concatenate([y_true, y_pred])) # Aggregate counts across all classes total_tp = 0 total_fp = 0 total_fn = 0 class_details = {} for cls in classes: tp = np.sum((y_pred == cls) & (y_true == cls)) fp = np.sum((y_pred == cls) & (y_true != cls)) fn = np.sum((y_pred != cls) & (y_true == cls)) total_tp += tp total_fp += fp total_fn += fn class_details[cls] = {'tp': tp, 'fp': fp, 'fn': fn} # Compute metrics from aggregated counts micro_precision = total_tp / (total_tp + total_fp) \ if (total_tp + total_fp) > 0 else 0.0 micro_recall = total_tp / (total_tp + total_fn) \ if (total_tp + total_fn) > 0 else 0.0 micro_f1 = 2 * micro_precision * micro_recall / \ (micro_precision + micro_recall) \ if (micro_precision + micro_recall) > 0 else 0.0 aggregated = { 'total_tp': total_tp, 'total_fp': total_fp, 'total_fn': total_fn, 'per_class': class_details } return micro_precision, micro_recall, micro_f1, aggregated def demonstrate_micro_accuracy_equivalence(): """ Demonstrate that micro-F1 equals accuracy in single-label classification. """ np.random.seed(42) # Generate random multi-class predictions n_samples = 1000 n_classes = 5 y_true = np.random.randint(0, n_classes, n_samples) y_pred = y_true.copy() # Introduce some errors error_idx = np.random.choice(n_samples, 200, replace=False) y_pred[error_idx] = np.random.randint(0, n_classes, len(error_idx)) # Compute metrics accuracy = accuracy_score(y_true, y_pred) micro_p, micro_r, micro_f1, _ = compute_micro_metrics_manual(y_true, y_pred) print("Demonstrating Micro-F1 = Accuracy equivalence:") print("-" * 50) print(f"Accuracy: {accuracy:.6f}") print(f"Micro-Precision: {micro_p:.6f}") print(f"Micro-Recall: {micro_r:.6f}") print(f"Micro-F1: {micro_f1:.6f}") print("-" * 50) print(f"Difference (Micro-F1 - Accuracy): {abs(micro_f1 - accuracy):.10f}") return accuracy, micro_f1 def show_majority_class_dominance(): """ Illustrate how micro averaging is dominated by majority classes. """ # Extreme imbalance: 950 samples from class 0, 25 each from classes 1 and 2 y_true = np.array([0] * 950 + [1] * 25 + [2] * 25) # Model: perfect on class 0, terrible on minority classes y_pred = np.array( [0] * 950 + # Class 0: 100% correct [0] * 25 + # Class 1: 0% correct (all predicted as 0) [0] * 25 # Class 2: 0% correct (all predicted as 0) ) # Compute both averaging methods micro_p, micro_r, micro_f1, _ = compute_micro_metrics_manual(y_true, y_pred) from compute_macro_metrics_manual import compute_macro_metrics_manual macro_p, macro_r, macro_f1, per_class = compute_macro_metrics_manual(y_true, y_pred) print("Majority Class Dominance in Micro Averaging:") print("=" * 60) print("Scenario: 95% class 0, 2.5% class 1, 2.5% class 2") print("Model: Perfect on class 0, completely fails on classes 1 and 2") print("-" * 60) print(f"Micro-averaged metrics (dominated by class 0):") print(f" Precision: {micro_p:.3f}") print(f" Recall: {micro_r:.3f}") print(f" F1: {micro_f1:.3f}") print("-" * 60) print(f"Macro-averaged metrics (equal weight to all classes):") print(f" Precision: {macro_p:.3f}") print(f" Recall: {macro_r:.3f}") print(f" F1: {macro_f1:.3f}") print("-" * 60) print("Observation: Micro shows 0.95 F1 (looks great!)") print(" Macro shows 0.32 F1 (reveals minority class failure)") if __name__ == "__main__": demonstrate_micro_accuracy_equivalence()Use micro averaging when:
• Class frequency in the test set reflects production importance (e.g., high-volume classifications where common cases matter most)
• Overall throughput accuracy matters (e.g., spam filtering where total correct classifications drive user experience)
• You want an interpretable 'overall accuracy' equivalent for stakeholder communication
• Multi-label classification where samples can have multiple labels (micro averaging is more meaningful here)
The difference between micro and macro averaging becomes most pronounced when class distributions are imbalanced and model performance varies across classes. Understanding when and why these methods diverge is essential for proper model evaluation.
| Property | Micro Averaging | Macro Averaging |
|---|---|---|
| Weighting scheme | Implicit weight by class frequency | Equal weight for all classes |
| Sensitivity to class imbalance | Dominated by majority classes | Equally sensitive to all classes |
| Relationship to accuracy | Equal to accuracy (single-label) | Not directly related to accuracy |
| Minority class influence | Minimal if minority is small | Equal influence regardless of size |
| Interpretability | 'Overall' performance | 'Average class' performance |
| Use in imbalanced data | May hide minority class failures | Reveals minority class issues |
| Computational complexity | O(n) single pass possible | O(n × K) or O(n) with per-class tracking |
Case study: Divergence in practice
Consider a medical diagnosis system classifying X-ray images into three categories:
A model achieves:
Micro-averaged F1: Approximately 0.92 (dominated by Normal's excellent performance)
Macro-averaged F1: Approximately 0.65 (reveals poor minority class performance)
Which metric tells the true story? It depends on the application context:
If the goal is to triage patients efficiently (most are healthy), the micro metric suggests the system works well overall.
If missing any disease is critical (false negatives are dangerous), the macro metric reveals that Condition B detection is severely inadequate.
A deployment decision based solely on micro-F1 might approve a dangerous system; macro-F1 would flag the issue.
A model optimizing for micro-averaged metrics can achieve high scores by focusing exclusively on majority classes while ignoring minorities. Always examine per-class metrics alongside aggregated metrics to detect this failure mode.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import precision_recall_fscore_support def analyze_metric_divergence( max_imbalance_ratio: int = 100, majority_performance: float = 0.95, minority_performance: float = 0.50, n_minority_samples: int = 100) -> dict: """ Analyze how micro and macro metrics diverge as class imbalance increases. This function systematically varies the imbalance ratio while keeping per-class performance constant, showing how micro averaging becomes increasingly insensitive to minority class failures. Parameters ---------- max_imbalance_ratio : int Maximum ratio of majority to minority class samples majority_performance : float Accuracy on majority class (held constant) minority_performance : float Accuracy on minority class (held constant) n_minority_samples : int Fixed number of minority class samples Returns ------- results : dict Contains imbalance ratios and corresponding metric values """ imbalance_ratios = list(range(1, max_imbalance_ratio + 1, 5)) micro_f1_scores = [] macro_f1_scores = [] for ratio in imbalance_ratios: n_majority = n_minority_samples * ratio # Generate ground truth y_true = np.array([0] * n_majority + [1] * n_minority_samples) # Generate predictions with fixed per-class performance n_correct_majority = int(n_majority * majority_performance) n_correct_minority = int(n_minority_samples * minority_performance) y_pred_majority = np.array( [0] * n_correct_majority + [1] * (n_majority - n_correct_majority) ) y_pred_minority = np.array( [1] * n_correct_minority + [0] * (n_minority_samples - n_correct_minority) ) y_pred = np.concatenate([y_pred_majority, y_pred_minority]) # Compute metrics _, _, micro_f1, _ = precision_recall_fscore_support( y_true, y_pred, average='micro', zero_division=0 ) _, _, macro_f1, _ = precision_recall_fscore_support( y_true, y_pred, average='macro', zero_division=0 ) micro_f1_scores.append(micro_f1) macro_f1_scores.append(macro_f1) results = { 'imbalance_ratios': imbalance_ratios, 'micro_f1': micro_f1_scores, 'macro_f1': macro_f1_scores, 'divergence': [m - M for m, M in zip(micro_f1_scores, macro_f1_scores)] } return results def visualize_divergence(results: dict): """ Visualize the divergence between micro and macro averaging. """ fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Metric values ax1 = axes[0] ax1.plot(results['imbalance_ratios'], results['micro_f1'], 'b-', linewidth=2, label='Micro F1') ax1.plot(results['imbalance_ratios'], results['macro_f1'], 'r-', linewidth=2, label='Macro F1') ax1.set_xlabel('Imbalance Ratio (Majority/Minority)', fontsize=12) ax1.set_ylabel('F1 Score', fontsize=12) ax1.set_title('Metric Values vs Class Imbalance', fontsize=14) ax1.legend() ax1.grid(True, alpha=0.3) ax1.set_ylim(0, 1) # Plot 2: Divergence ax2 = axes[1] ax2.fill_between(results['imbalance_ratios'], 0, results['divergence'], alpha=0.3, color='purple') ax2.plot(results['imbalance_ratios'], results['divergence'], 'purple', linewidth=2) ax2.set_xlabel('Imbalance Ratio (Majority/Minority)', fontsize=12) ax2.set_ylabel('Micro F1 - Macro F1', fontsize=12) ax2.set_title('Divergence Between Averaging Methods', fontsize=14) ax2.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('metric_divergence.png', dpi=150, bbox_inches='tight') plt.show() return fig if __name__ == "__main__": results = analyze_metric_divergence( max_imbalance_ratio=100, majority_performance=0.95, minority_performance=0.50 ) print("Divergence Analysis Results:") print("-" * 50) for ratio, micro, macro in zip( results['imbalance_ratios'][:10], results['micro_f1'][:10], results['macro_f1'][:10] ): print(f"Ratio {ratio:3d}:1 | Micro: {micro:.3f} | Macro: {macro:.3f} | " f"Gap: {micro - macro:.3f}") visualize_divergence(results)Understanding the mathematical relationships between micro and macro averaging provides deeper insight into their behavior and helps predict when they will diverge significantly.
Bound relationship: Macro-averaged metrics are bounded by the range of per-class metrics:
min(Mᵢ) ≤ Macro-M ≤ max(Mᵢ)
Weighted representation: Micro averaging can be expressed as a weighted macro average:
Micro-M = Σᵢ wᵢ × Mᵢ
where wᵢ = (TPᵢ + FPᵢ) / Σⱼ(TPⱼ + FPⱼ) for precision, and wᵢ = (TPᵢ + FNᵢ) / Σⱼ(TPⱼ + FNⱼ) for recall.
Equality condition: Micro = Macro when all classes have equal performance or equal sample sizes.
Analyzing the deviation:
The gap between micro and macro averaging can be decomposed:
Micro - Macro = Σᵢ (wᵢ - 1/K) × Mᵢ
where:
This reveals that the gap depends on:
Maximum divergence occurs when:
Minimum divergence occurs when:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import numpy as npfrom typing import List, Tuple def decompose_micro_macro_gap( per_class_metrics: List[float], per_class_weights: List[float]) -> Tuple[float, float, float, np.ndarray]: """ Decompose the gap between micro and macro averaging. The gap can be written as: Micro - Macro = sum_i (w_i - 1/K) * M_i This function analyzes the contribution of each class to the gap. Parameters ---------- per_class_metrics : list of float Metric value for each class (e.g., precision or F1) per_class_weights : list of float Implicit weight of each class in micro averaging Returns ------- micro : float Micro-averaged metric macro : float Macro-averaged metric gap : float Difference (micro - macro) contributions : ndarray Per-class contribution to the gap """ K = len(per_class_metrics) M = np.array(per_class_metrics) W = np.array(per_class_weights) # Normalize weights to sum to 1 W = W / W.sum() # Compute averages micro = np.sum(W * M) macro = np.mean(M) # Decompose gap weight_deviations = W - 1/K contributions = weight_deviations * M gap = np.sum(contributions) return micro, macro, gap, contributions def analyze_gap_contributors(): """ Illustrate which classes contribute to micro-macro divergence. """ # Example: 5 classes with varying sizes and performances per_class_f1 = [0.95, 0.88, 0.72, 0.55, 0.40] # Performance per_class_size = [10000, 5000, 2000, 500, 100] # Class sizes # Weights are proportional to size (simplified for F1) weights = np.array(per_class_size, dtype=float) micro, macro, gap, contributions = decompose_micro_macro_gap( per_class_f1, weights ) print("Gap Decomposition Analysis:") print("=" * 65) print(f"{'Class':<8} {'Size':<10} {'Weight':<10} {'F1':<8} {'Contribution':<12}") print("-" * 65) norm_weights = weights / weights.sum() for i, (f1, size, w, c) in enumerate(zip( per_class_f1, per_class_size, norm_weights, contributions )): print(f"{i:<8} {size:<10} {w:<10.4f} {f1:<8.3f} {c:+.4f}") print("-" * 65) print(f"Micro F1: {micro:.4f}") print(f"Macro F1: {macro:.4f}") print(f"Gap (Micro - Macro): {gap:+.4f}") print("-" * 65) print("\nInterpretation:") print(f" - Large positive contributions from classes {np.argmax(contributions)}") print(f" - Large negative contributions from classes {np.argmin(contributions)}") print(f" - Micro > Macro because large classes have high performance") def prove_micro_bounds(): """ Demonstrate that micro averaging is bounded by weighted extremes. """ # For any set of per-class metrics and weights np.random.seed(42) n_trials = 1000 bounds_satisfied = 0 for _ in range(n_trials): K = np.random.randint(3, 10) weights = np.random.dirichlet(np.ones(K)) metrics = np.random.uniform(0, 1, K) micro = np.sum(weights * metrics) # Check bounds lower_bound = np.min(metrics) upper_bound = np.max(metrics) if lower_bound <= micro <= upper_bound: bounds_satisfied += 1 print(f"\nBounds verification ({n_trials} trials):") print(f" Micro within [min(M_i), max(M_i)]: {bounds_satisfied}/{n_trials} " f"({100*bounds_satisfied/n_trials:.1f}%)") return bounds_satisfied == n_trials if __name__ == "__main__": analyze_gap_contributors() prove_micro_bounds()Choosing between micro and macro averaging requires understanding your domain context, stakeholder requirements, and the specific failure modes you need to detect. Here's a structured framework for making this decision.
In academic papers and internal reports, always report both micro and macro metrics. The gap between them is itself informative—a large gap indicates either class imbalance, performance variation across classes, or both. Report per-class metrics for complete transparency.
| Use Case | Recommended Method | Rationale |
|---|---|---|
| Spam detection | Micro | Total throughput matters; rare spam is less critical |
| Medical diagnosis | Macro | Each disease equally important regardless of prevalence |
| Document classification | Depends | Topic distribution may or may not match importance |
| Sentiment analysis | Macro | Usually want balanced detection of all sentiments |
| Multi-label tagging | Micro | Per-label accuracy is the natural unit |
| Fraud detection | Macro or weighted | Rare fraud class must not be ignored |
| Named entity recognition | Macro | All entity types should be detected well |
| Image classification (balanced) | Either | When balanced, they approximate each other |
Correct implementation of averaging strategies requires attention to edge cases, proper handling of undefined metrics, and integration with existing evaluation pipelines.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227
import numpy as npfrom sklearn.metrics import classification_report, precision_recall_fscore_supportfrom typing import Dict, Any, Optionalimport json class MultiClassEvaluator: """ Production-grade multi-class evaluation with comprehensive averaging. This class provides a complete evaluation framework that computes all averaging methods, handles edge cases, and produces stakeholder- friendly reports. """ def __init__( self, class_names: Optional[Dict[int, str]] = None, zero_division: float = 0.0 ): """ Initialize the evaluator. Parameters ---------- class_names : dict, optional Mapping from class index to human-readable name zero_division : float Value to return when a metric is undefined (division by zero) """ self.class_names = class_names or {} self.zero_division = zero_division def evaluate( self, y_true: np.ndarray, y_pred: np.ndarray, y_prob: Optional[np.ndarray] = None ) -> Dict[str, Any]: """ Perform comprehensive multi-class evaluation. Returns metrics at all levels: per-class, micro, macro, and weighted. Parameters ---------- y_true : ndarray Ground truth labels y_pred : ndarray Predicted labels y_prob : ndarray, optional Predicted probabilities for each class Returns ------- report : dict Complete evaluation report """ classes = np.unique(np.concatenate([y_true, y_pred])) n_classes = len(classes) n_samples = len(y_true) # Per-class metrics per_class = {} for cls in classes: name = self.class_names.get(cls, f"Class_{cls}") tp = np.sum((y_pred == cls) & (y_true == cls)) fp = np.sum((y_pred == cls) & (y_true != cls)) fn = np.sum((y_pred != cls) & (y_true == cls)) support = np.sum(y_true == cls) precision = self._safe_divide(tp, tp + fp) recall = self._safe_divide(tp, tp + fn) f1 = self._safe_divide(2 * precision * recall, precision + recall) per_class[name] = { 'precision': precision, 'recall': recall, 'f1': f1, 'support': int(support), 'true_positives': int(tp), 'false_positives': int(fp), 'false_negatives': int(fn) } # Aggregated metrics micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support( y_true, y_pred, average='micro', zero_division=self.zero_division ) macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support( y_true, y_pred, average='macro', zero_division=self.zero_division ) weighted_p, weighted_r, weighted_f1, _ = precision_recall_fscore_support( y_true, y_pred, average='weighted', zero_division=self.zero_division ) accuracy = np.mean(y_true == y_pred) # Compute gap as diagnostic micro_macro_gap = micro_f1 - macro_f1 report = { 'summary': { 'n_samples': n_samples, 'n_classes': n_classes, 'accuracy': float(accuracy), 'micro_macro_gap': float(micro_macro_gap), 'class_balance': self._compute_balance_metrics(y_true, classes) }, 'micro': { 'precision': float(micro_p), 'recall': float(micro_r), 'f1': float(micro_f1) }, 'macro': { 'precision': float(macro_p), 'recall': float(macro_r), 'f1': float(macro_f1) }, 'weighted': { 'precision': float(weighted_p), 'recall': float(weighted_r), 'f1': float(weighted_f1) }, 'per_class': per_class } return report def _safe_divide(self, numerator: float, denominator: float) -> float: """Safe division with zero handling.""" return numerator / denominator if denominator > 0 else self.zero_division def _compute_balance_metrics( self, y_true: np.ndarray, classes: np.ndarray ) -> Dict[str, float]: """Compute class balance statistics.""" counts = [np.sum(y_true == c) for c in classes] total = len(y_true) imbalance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf') # Gini coefficient of class distribution proportions = np.array(counts) / total gini = 1 - np.sum(proportions ** 2) return { 'imbalance_ratio': float(imbalance_ratio), 'normalized_entropy': float(self._normalized_entropy(proportions)), 'gini_coefficient': float(gini) } def _normalized_entropy(self, proportions: np.ndarray) -> float: """Compute normalized entropy of class distribution.""" proportions = proportions[proportions > 0] # Avoid log(0) entropy = -np.sum(proportions * np.log(proportions)) max_entropy = np.log(len(proportions)) return entropy / max_entropy if max_entropy > 0 else 0.0 def format_report(self, report: Dict[str, Any]) -> str: """Format report as human-readable string.""" lines = [] lines.append("=" * 70) lines.append("MULTI-CLASS CLASSIFICATION REPORT") lines.append("=" * 70) # Summary s = report['summary'] lines.append(f"\nSamples: {s['n_samples']:,} | Classes: {s['n_classes']}") lines.append(f"Accuracy: {s['accuracy']:.4f}") lines.append(f"Micro-Macro Gap: {s['micro_macro_gap']:+.4f}") lines.append(f"Imbalance Ratio: {s['class_balance']['imbalance_ratio']:.2f}") # Aggregated metrics lines.append("\n" + "-" * 70) lines.append(f"{'Averaging':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}") lines.append("-" * 70) for avg in ['micro', 'macro', 'weighted']: m = report[avg] lines.append(f"{avg.capitalize():<12} {m['precision']:<12.4f} " f"{m['recall']:<12.4f} {m['f1']:<12.4f}") # Per-class lines.append("\n" + "-" * 70) lines.append(f"{'Class':<20} {'P':<8} {'R':<8} {'F1':<8} {'Support':<10}") lines.append("-" * 70) for name, m in report['per_class'].items(): lines.append(f"{name:<20} {m['precision']:<8.4f} {m['recall']:<8.4f} " f"{m['f1']:<8.4f} {m['support']:<10}") lines.append("=" * 70) return "\n".join(lines) # Example usagedef demonstrate_evaluator(): np.random.seed(42) # Simulate imbalanced multi-class problem y_true = np.concatenate([ np.zeros(800), np.ones(150), np.full(50, 2) ]).astype(int) y_pred = y_true.copy() # Add errors y_pred[:80] = 1 # Some class 0 predicted as 1 y_pred[800:820] = 0 # Some class 1 predicted as 0 y_pred[950:] = 1 # All class 2 predicted as 1 evaluator = MultiClassEvaluator( class_names={0: 'Negative', 1: 'Neutral', 2: 'Positive'} ) report = evaluator.evaluate(y_true, y_pred) print(evaluator.format_report(report)) return report if __name__ == "__main__": demonstrate_evaluator()When a class has no predictions (all FP and TP = 0), precision is undefined. When a class has no true samples (FN and TP = 0), recall is undefined. The zero_division parameter controls behavior in these cases:
• zero_division=0: Return 0 (conservative, may lower averages) • zero_division=1: Return 1 (optimistic, treats undefined as perfect) • zero_division='warn': Return 0 but emit a warning
For macro averaging, undefined per-class metrics significantly impact the average. Document your choice.
We've comprehensively explored the two fundamental approaches to aggregating multi-class metrics. Let's consolidate the key insights:
What's next:
Micro and macro averaging represent the extremes of uniform weighting (by instance vs. by class). In the next page, we'll explore weighted averaging, which provides a middle ground by allowing explicit, customized importance weights for each class. This approach enables you to align evaluation metrics directly with business priorities.
You now understand the mathematical foundations, practical implications, and decision frameworks for micro and macro averaging. You can analyze when these methods diverge, diagnose what the divergence indicates, and make principled choices about which averaging strategy fits your application context.