Loading content...
Micro averaging weights classes by sample frequency; macro averaging assigns equal weights. But what if neither reflects your true priorities?
In real-world applications, class importance rarely aligns with either extreme. A fraud detection system might need to weight rare fraud cases 10× more than legitimate transactions, but not equally as macro averaging would suggest. A medical diagnostic might weight critical conditions more heavily than benign findings, but the weighting should reflect clinical severity, not just prevalence.
Weighted averaging provides the flexibility to align evaluation metrics directly with business priorities, enabling principled evaluation that reflects actual deployment requirements.
By the end of this page, you will understand support-based weighted averaging, design custom weighting schemes for domain-specific priorities, analyze how weight choices affect metric behavior, and implement production-ready weighted evaluation pipelines.
The most common form of weighted averaging is support-weighted averaging, where each class is weighted by its number of samples in the evaluation set. This represents a middle ground between micro and macro averaging.
For a metric M across K classes with supports sᵢ:
Weighted-M = Σᵢ (sᵢ / Σⱼ sⱼ) × Mᵢ
where sᵢ = |{samples with true label = i}| is the support for class i.
This differs from micro averaging, which weights by prediction-based counts rather than ground-truth counts.
Key distinction from micro averaging:
While both methods give more weight to larger classes, they use different weighting bases:
For recall, micro and weighted are identical. For precision, they can differ when the model's prediction distribution doesn't match the true class distribution.
When they diverge:
If a model over-predicts class A and under-predicts class B:
This makes weighted averaging more stable when comparing models with different prediction biases.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom sklearn.metrics import precision_recall_fscore_supportfrom typing import Dict, List, Tuple, Optional def compute_weighted_metrics( y_true: np.ndarray, y_pred: np.ndarray, weights: Optional[Dict[int, float]] = None) -> Tuple[float, float, float]: """ Compute weighted-averaged precision, recall, and F1. Parameters ---------- y_true : array-like Ground truth labels y_pred : array-like Predicted labels weights : dict, optional Custom weights per class. If None, uses support-based weights. Returns ------- precision, recall, f1 : floats Weighted-averaged metrics """ if weights is None: # Use sklearn's built-in support-weighted averaging p, r, f1, _ = precision_recall_fscore_support( y_true, y_pred, average='weighted', zero_division=0 ) return p, r, f1 # Custom weights implementation classes = np.unique(np.concatenate([y_true, y_pred])) per_class_metrics = [] class_weights = [] for cls in classes: tp = np.sum((y_pred == cls) & (y_true == cls)) fp = np.sum((y_pred == cls) & (y_true != cls)) fn = np.sum((y_pred != cls) & (y_true == cls)) p = tp / (tp + fp) if (tp + fp) > 0 else 0.0 r = tp / (tp + fn) if (tp + fn) > 0 else 0.0 f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0 per_class_metrics.append({'precision': p, 'recall': r, 'f1': f1}) class_weights.append(weights.get(cls, 1.0)) # Normalize weights total_weight = sum(class_weights) norm_weights = [w / total_weight for w in class_weights] weighted_p = sum(w * m['precision'] for w, m in zip(norm_weights, per_class_metrics)) weighted_r = sum(w * m['recall'] for w, m in zip(norm_weights, per_class_metrics)) weighted_f1 = sum(w * m['f1'] for w, m in zip(norm_weights, per_class_metrics)) return weighted_p, weighted_r, weighted_f1 def compare_averaging_methods(y_true: np.ndarray, y_pred: np.ndarray): """Compare micro, macro, and weighted averaging.""" micro = precision_recall_fscore_support(y_true, y_pred, average='micro') macro = precision_recall_fscore_support(y_true, y_pred, average='macro') weighted = precision_recall_fscore_support(y_true, y_pred, average='weighted') print("Averaging Method Comparison:") print("-" * 50) print(f"{'Method':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}") print("-" * 50) print(f"{'Micro':<12} {micro[0]:<12.4f} {micro[1]:<12.4f} {micro[2]:<12.4f}") print(f"{'Macro':<12} {macro[0]:<12.4f} {macro[1]:<12.4f} {macro[2]:<12.4f}") print(f"{'Weighted':<12} {weighted[0]:<12.4f} {weighted[1]:<12.4f} {weighted[2]:<12.4f}")Beyond support-based weighting, you can design custom weight schemes that align with specific business requirements. Common strategies include:
| Strategy | Weight Formula | Use Case |
|---|---|---|
| Inverse frequency | wᵢ = 1/sᵢ | Balances toward minority classes (similar to macro) |
| Square root inverse | wᵢ = 1/√sᵢ | Moderate boost for minorities, less extreme |
| Cost-based | wᵢ = misclassification_costᵢ | Aligns with business loss function |
| Log inverse | wᵢ = log(N/sᵢ) | TF-IDF style, smooth minority boost |
| Binary importance | wᵢ ∈ {1, c} | Focus on specific critical classes |
| Hierarchical | wᵢ = f(depth in taxonomy) | Taxonomy-aware evaluation |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom typing import Dict, Callable class WeightingStrategy: """Factory for common weighting strategies.""" @staticmethod def support_weights(supports: Dict[int, int]) -> Dict[int, float]: """Standard support-based weights.""" total = sum(supports.values()) return {k: v / total for k, v in supports.items()} @staticmethod def inverse_frequency(supports: Dict[int, int]) -> Dict[int, float]: """Inverse frequency weighting (approximates macro).""" inv = {k: 1.0 / v for k, v in supports.items()} total = sum(inv.values()) return {k: v / total for k, v in inv.items()} @staticmethod def sqrt_inverse(supports: Dict[int, int]) -> Dict[int, float]: """Square root inverse - moderate minority boost.""" inv = {k: 1.0 / np.sqrt(v) for k, v in supports.items()} total = sum(inv.values()) return {k: v / total for k, v in inv.items()} @staticmethod def cost_based(costs: Dict[int, float]) -> Dict[int, float]: """Weights directly from cost matrix diagonal.""" total = sum(costs.values()) return {k: v / total for k, v in costs.items()} @staticmethod def focal_weights(supports: Dict[int, int], gamma: float = 2.0) -> Dict[int, float]: """Focal-loss inspired weighting.""" total_samples = sum(supports.values()) weights = {} for k, v in supports.items(): freq = v / total_samples weights[k] = (1 - freq) ** gamma total = sum(weights.values()) return {k: v / total for k, v in weights.items()} def demonstrate_weight_impact(): """Show how different weights change evaluation results.""" np.random.seed(42) # Imbalanced data y_true = np.array([0]*800 + [1]*150 + [2]*50) y_pred = y_true.copy() y_pred[:100] = 1 # Class 0 errors y_pred[800:830] = 0 # Class 1 errors y_pred[950:] = 0 # Class 2 completely wrong supports = {0: 800, 1: 150, 2: 50} strategies = { 'Support': WeightingStrategy.support_weights(supports), 'Inverse': WeightingStrategy.inverse_frequency(supports), 'Sqrt-Inv': WeightingStrategy.sqrt_inverse(supports), 'Focal(γ=2)': WeightingStrategy.focal_weights(supports, gamma=2.0), } print("Weight Strategy Comparison:") print("=" * 60) for name, weights in strategies.items(): p, r, f1 = compute_weighted_metrics(y_true, y_pred, weights) print(f"{name:<12}: P={p:.3f}, R={r:.3f}, F1={f1:.3f}") print(f" Weights: {weights}")The best weighting strategy depends on your objectives:
• Maximize user-perceived accuracy → Support weights • Ensure fairness across groups → Inverse frequency • Minimize financial loss → Cost-based weights • Focus on hard cases → Focal weights
Document your choice—it's a design decision that affects model selection.
The most principled approach to weighted evaluation derives weights from a cost matrix that quantifies the real-world impact of each type of error. This creates direct alignment between your evaluation metric and business outcomes.
The cost matrix formulation:
Define a K×K cost matrix C where Cᵢⱼ is the cost of predicting class j when the true class is i:
| Pred: 0 | Pred: 1 | Pred: 2 | |
|---|---|---|---|
| True: 0 | 0 | 10 | 5 |
| True: 1 | 100 | 0 | 20 |
| True: 2 | 50 | 15 | 0 |
In this example:
From cost matrix to class weights:
Class importance can be derived from the cost matrix as:
wᵢ = Σⱼ Cᵢⱼ (total cost of misclassifying class i)
Or more sophisticated formulations that consider both directions of errors.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npfrom typing import Dict def cost_matrix_to_weights(cost_matrix: np.ndarray) -> Dict[int, float]: """ Derive class weights from a cost matrix. Weight for each class is the average cost of misclassifying samples from that class. """ K = cost_matrix.shape[0] weights = {} for i in range(K): # Sum of costs when true class is i (row i, excluding diagonal) misclass_costs = np.sum(cost_matrix[i, :]) - cost_matrix[i, i] weights[i] = misclass_costs / (K - 1) # Average over wrong predictions # Normalize total = sum(weights.values()) return {k: v / total for k, v in weights.items()} def compute_expected_cost( y_true: np.ndarray, y_pred: np.ndarray, cost_matrix: np.ndarray) -> float: """ Compute total expected cost of predictions. This is the most direct business-aligned metric. """ total_cost = 0.0 for true_label, pred_label in zip(y_true, y_pred): total_cost += cost_matrix[true_label, pred_label] return total_cost / len(y_true) # Average cost per sample # Example: Fraud detection cost matrixfraud_costs = np.array([ [0, 1], # True legitimate: predict legit=0, predict fraud=1 (false positive) [100, 0] # True fraud: predict legit=100, predict fraud=0 (caught)]) print("Fraud Detection Costs:")print(f" False Positive (flag legit as fraud): $1")print(f" False Negative (miss fraud): $100")print(f" Derived weights: {cost_matrix_to_weights(fraud_costs)}")Understanding how metrics change with weight variations is crucial for robust evaluation. Sensitivity analysis helps identify whether your conclusions depend critically on specific weight choices.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npimport matplotlib.pyplot as pltfrom typing import List, Tuple def weight_sensitivity_analysis( y_true: np.ndarray, y_pred: np.ndarray, class_to_vary: int, weight_range: Tuple[float, float] = (0.1, 10.0), n_points: int = 50) -> dict: """ Analyze how weighted F1 changes as one class's weight varies. Returns data for sensitivity visualization. """ weight_values = np.linspace(weight_range[0], weight_range[1], n_points) f1_values = [] classes = np.unique(y_true) base_weights = {c: 1.0 for c in classes} for w in weight_values: weights = base_weights.copy() weights[class_to_vary] = w _, _, f1 = compute_weighted_metrics(y_true, y_pred, weights) f1_values.append(f1) # Compute sensitivity (derivative) sensitivity = np.gradient(f1_values, weight_values) return { 'weights': weight_values, 'f1': f1_values, 'sensitivity': sensitivity, 'max_sensitivity_at': weight_values[np.argmax(np.abs(sensitivity))] } def robustness_check( y_true: np.ndarray, y_pred: np.ndarray, base_weights: dict, perturbation: float = 0.1, n_samples: int = 100) -> dict: """ Check robustness of weighted F1 to small weight perturbations. Samples random perturbations and reports F1 distribution. """ f1_samples = [] for _ in range(n_samples): perturbed = {} for k, v in base_weights.items(): noise = np.random.uniform(-perturbation, perturbation) perturbed[k] = max(0.01, v * (1 + noise)) _, _, f1 = compute_weighted_metrics(y_true, y_pred, perturbed) f1_samples.append(f1) return { 'mean_f1': np.mean(f1_samples), 'std_f1': np.std(f1_samples), 'range': (np.min(f1_samples), np.max(f1_samples)), 'is_robust': np.std(f1_samples) < 0.01 # Arbitrary threshold }If small weight changes flip which model appears best, your evaluation is unstable. Either refine your weights with stakeholder input, or report results across a range of plausible weight vectors.
You now understand weighted averaging as a flexible middle ground between micro and macro approaches. Next, we'll explore Cohen's kappa—a metric that accounts for chance agreement and provides a more robust measure of classifier quality.