Multi Class Metrics - Learning Module

Loading content...

0/245

Weighted Averaging

Beyond Uniform Weights

Micro averaging weights classes by sample frequency; macro averaging assigns equal weights. But what if neither reflects your true priorities?

In real-world applications, class importance rarely aligns with either extreme. A fraud detection system might need to weight rare fraud cases 10× more than legitimate transactions, but not equally as macro averaging would suggest. A medical diagnostic might weight critical conditions more heavily than benign findings, but the weighting should reflect clinical severity, not just prevalence.

Weighted averaging provides the flexibility to align evaluation metrics directly with business priorities, enabling principled evaluation that reflects actual deployment requirements.

What You Will Master

By the end of this page, you will understand support-based weighted averaging, design custom weighting schemes for domain-specific priorities, analyze how weight choices affect metric behavior, and implement production-ready weighted evaluation pipelines.

Support-Weighted Averaging

The most common form of weighted averaging is support-weighted averaging, where each class is weighted by its number of samples in the evaluation set. This represents a middle ground between micro and macro averaging.

Mathematical Definition

For a metric M across K classes with supports sᵢ:

Weighted-M = Σᵢ (sᵢ / Σⱼ sⱼ) × Mᵢ

where sᵢ = |{samples with true label = i}| is the support for class i.

This differs from micro averaging, which weights by prediction-based counts rather than ground-truth counts.

Key distinction from micro averaging:

While both methods give more weight to larger classes, they use different weighting bases:

Micro: Weights by (TPᵢ + FPᵢ) for precision, (TPᵢ + FNᵢ) for recall
Weighted: Weights by sᵢ = (TPᵢ + FNᵢ) = support for all metrics

For recall, micro and weighted are identical. For precision, they can differ when the model's prediction distribution doesn't match the true class distribution.

When they diverge:

If a model over-predicts class A and under-predicts class B:

Micro-precision weights class A more (more predictions = more weight)
Weighted-precision weights by true support (ignoring prediction imbalance)

This makes weighted averaging more stable when comparing models with different prediction biases.

weighted_averaging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
from typing import Dict, List, Tuple, Optional
 
def compute_weighted_metrics(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    weights: Optional[Dict[int, float]] = None
) -> Tuple[float, float, float]:
    """
    Compute weighted-averaged precision, recall, and F1.
    
    Parameters
    ----------
    y_true : array-like
        Ground truth labels
    y_pred : array-like
        Predicted labels
    weights : dict, optional
        Custom weights per class. If None, uses support-based weights.
        
    Returns
    -------
    precision, recall, f1 : floats
        Weighted-averaged metrics
    """
    if weights is None:
        # Use sklearn's built-in support-weighted averaging
        p, r, f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='weighted', zero_division=0
        )
        return p, r, f1
    
    # Custom weights implementation
    classes = np.unique(np.concatenate([y_true, y_pred]))
    
    per_class_metrics = []
    class_weights = []
    
    for cls in classes:
        tp = np.sum((y_pred == cls) & (y_true == cls))
        fp = np.sum((y_pred == cls) & (y_true != cls))
        fn = np.sum((y_pred != cls) & (y_true == cls))
        
        p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
        
        per_class_metrics.append({'precision': p, 'recall': r, 'f1': f1})
        class_weights.append(weights.get(cls, 1.0))
    
    # Normalize weights
    total_weight = sum(class_weights)
    norm_weights = [w / total_weight for w in class_weights]
    
    weighted_p = sum(w * m['precision'] for w, m in zip(norm_weights, per_class_metrics))
    weighted_r = sum(w * m['recall'] for w, m in zip(norm_weights, per_class_metrics))
    weighted_f1 = sum(w * m['f1'] for w, m in zip(norm_weights, per_class_metrics))
    
    return weighted_p, weighted_r, weighted_f1
 
 
def compare_averaging_methods(y_true: np.ndarray, y_pred: np.ndarray):
    """Compare micro, macro, and weighted averaging."""
    micro = precision_recall_fscore_support(y_true, y_pred, average='micro')
    macro = precision_recall_fscore_support(y_true, y_pred, average='macro')
    weighted = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    
    print("Averaging Method Comparison:")
    print("-" * 50)
    print(f"{'Method':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}")
    print("-" * 50)
    print(f"{'Micro':<12} {micro[0]:<12.4f} {micro[1]:<12.4f} {micro[2]:<12.4f}")
    print(f"{'Macro':<12} {macro[0]:<12.4f} {macro[1]:<12.4f} {macro[2]:<12.4f}")
    print(f"{'Weighted':<12} {weighted[0]:<12.4f} {weighted[1]:<12.4f} {weighted[2]:<12.4f}")

Custom Weight Strategies

Beyond support-based weighting, you can design custom weight schemes that align with specific business requirements. Common strategies include:

Custom Weighting Strategies
Strategy	Weight Formula	Use Case
Inverse frequency	wᵢ = 1/sᵢ	Balances toward minority classes (similar to macro)
Square root inverse	wᵢ = 1/√sᵢ	Moderate boost for minorities, less extreme
Cost-based	wᵢ = misclassification_costᵢ	Aligns with business loss function
Log inverse	wᵢ = log(N/sᵢ)	TF-IDF style, smooth minority boost
Binary importance	wᵢ ∈ {1, c}	Focus on specific critical classes
Hierarchical	wᵢ = f(depth in taxonomy)	Taxonomy-aware evaluation

custom_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from typing import Dict, Callable
 
class WeightingStrategy:
    """Factory for common weighting strategies."""
    
    @staticmethod
    def support_weights(supports: Dict[int, int]) -> Dict[int, float]:
        """Standard support-based weights."""
        total = sum(supports.values())
        return {k: v / total for k, v in supports.items()}
    
    @staticmethod
    def inverse_frequency(supports: Dict[int, int]) -> Dict[int, float]:
        """Inverse frequency weighting (approximates macro)."""
        inv = {k: 1.0 / v for k, v in supports.items()}
        total = sum(inv.values())
        return {k: v / total for k, v in inv.items()}
    
    @staticmethod
    def sqrt_inverse(supports: Dict[int, int]) -> Dict[int, float]:
        """Square root inverse - moderate minority boost."""
        inv = {k: 1.0 / np.sqrt(v) for k, v in supports.items()}
        total = sum(inv.values())
        return {k: v / total for k, v in inv.items()}
    
    @staticmethod
    def cost_based(costs: Dict[int, float]) -> Dict[int, float]:
        """Weights directly from cost matrix diagonal."""
        total = sum(costs.values())
        return {k: v / total for k, v in costs.items()}
    
    @staticmethod
    def focal_weights(supports: Dict[int, int], gamma: float = 2.0) -> Dict[int, float]:
        """Focal-loss inspired weighting."""
        total_samples = sum(supports.values())
        weights = {}
        for k, v in supports.items():
            freq = v / total_samples
            weights[k] = (1 - freq) ** gamma
        total = sum(weights.values())
        return {k: v / total for k, v in weights.items()}
 
 
def demonstrate_weight_impact():
    """Show how different weights change evaluation results."""
    np.random.seed(42)
    
    # Imbalanced data
    y_true = np.array([0]*800 + [1]*150 + [2]*50)
    y_pred = y_true.copy()
    y_pred[:100] = 1  # Class 0 errors
    y_pred[800:830] = 0  # Class 1 errors  
    y_pred[950:] = 0  # Class 2 completely wrong
    
    supports = {0: 800, 1: 150, 2: 50}
    
    strategies = {
        'Support': WeightingStrategy.support_weights(supports),
        'Inverse': WeightingStrategy.inverse_frequency(supports),
        'Sqrt-Inv': WeightingStrategy.sqrt_inverse(supports),
        'Focal(γ=2)': WeightingStrategy.focal_weights(supports, gamma=2.0),
    }
    
    print("Weight Strategy Comparison:")
    print("=" * 60)
    for name, weights in strategies.items():
        p, r, f1 = compute_weighted_metrics(y_true, y_pred, weights)
        print(f"{name:<12}: P={p:.3f}, R={r:.3f}, F1={f1:.3f}")
        print(f"  Weights: {weights}")

Choosing Weight Strategies

The best weighting strategy depends on your objectives:

• Maximize user-perceived accuracy → Support weights • Ensure fairness across groups → Inverse frequency • Minimize financial loss → Cost-based weights • Focus on hard cases → Focal weights

Document your choice—it's a design decision that affects model selection.

Cost-Sensitive Evaluation

The most principled approach to weighted evaluation derives weights from a cost matrix that quantifies the real-world impact of each type of error. This creates direct alignment between your evaluation metric and business outcomes.

The cost matrix formulation:

Define a K×K cost matrix C where Cᵢⱼ is the cost of predicting class j when the true class is i:

	Pred: 0	Pred: 1	Pred: 2
True: 0	0	10	5
True: 1	100	0	20
True: 2	50	15	0

In this example:

Predicting class 0 when true is class 1 costs 100 (most expensive error)
Predicting class 1 when true is class 0 costs 10 (less severe)

From cost matrix to class weights:

Class importance can be derived from the cost matrix as:

wᵢ = Σⱼ Cᵢⱼ (total cost of misclassifying class i)

Or more sophisticated formulations that consider both directions of errors.

cost_sensitive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
from typing import Dict
 
def cost_matrix_to_weights(cost_matrix: np.ndarray) -> Dict[int, float]:
    """
    Derive class weights from a cost matrix.
    
    Weight for each class is the average cost of misclassifying
    samples from that class.
    """
    K = cost_matrix.shape[0]
    weights = {}
    
    for i in range(K):
        # Sum of costs when true class is i (row i, excluding diagonal)
        misclass_costs = np.sum(cost_matrix[i, :]) - cost_matrix[i, i]
        weights[i] = misclass_costs / (K - 1)  # Average over wrong predictions
    
    # Normalize
    total = sum(weights.values())
    return {k: v / total for k, v in weights.items()}
 
 
def compute_expected_cost(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    cost_matrix: np.ndarray
) -> float:
    """
    Compute total expected cost of predictions.
    
    This is the most direct business-aligned metric.
    """
    total_cost = 0.0
    for true_label, pred_label in zip(y_true, y_pred):
        total_cost += cost_matrix[true_label, pred_label]
    return total_cost / len(y_true)  # Average cost per sample
 
 
# Example: Fraud detection cost matrix
fraud_costs = np.array([
    [0, 1],      # True legitimate: predict legit=0, predict fraud=1 (false positive)
    [100, 0]     # True fraud: predict legit=100, predict fraud=0 (caught)
])
 
print("Fraud Detection Costs:")
print(f"  False Positive (flag legit as fraud): $1")
print(f"  False Negative (miss fraud): $100")
print(f"  Derived weights: {cost_matrix_to_weights(fraud_costs)}")

Weight Sensitivity Analysis

Understanding how metrics change with weight variations is crucial for robust evaluation. Sensitivity analysis helps identify whether your conclusions depend critically on specific weight choices.

sensitivity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
 
def weight_sensitivity_analysis(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    class_to_vary: int,
    weight_range: Tuple[float, float] = (0.1, 10.0),
    n_points: int = 50
) -> dict:
    """
    Analyze how weighted F1 changes as one class's weight varies.
    
    Returns data for sensitivity visualization.
    """
    weight_values = np.linspace(weight_range[0], weight_range[1], n_points)
    f1_values = []
    
    classes = np.unique(y_true)
    base_weights = {c: 1.0 for c in classes}
    
    for w in weight_values:
        weights = base_weights.copy()
        weights[class_to_vary] = w
        _, _, f1 = compute_weighted_metrics(y_true, y_pred, weights)
        f1_values.append(f1)
    
    # Compute sensitivity (derivative)
    sensitivity = np.gradient(f1_values, weight_values)
    
    return {
        'weights': weight_values,
        'f1': f1_values,
        'sensitivity': sensitivity,
        'max_sensitivity_at': weight_values[np.argmax(np.abs(sensitivity))]
    }
 
 
def robustness_check(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    base_weights: dict,
    perturbation: float = 0.1,
    n_samples: int = 100
) -> dict:
    """
    Check robustness of weighted F1 to small weight perturbations.
    
    Samples random perturbations and reports F1 distribution.
    """
    f1_samples = []
    
    for _ in range(n_samples):
        perturbed = {}
        for k, v in base_weights.items():
            noise = np.random.uniform(-perturbation, perturbation)
            perturbed[k] = max(0.01, v * (1 + noise))
        
        _, _, f1 = compute_weighted_metrics(y_true, y_pred, perturbed)
        f1_samples.append(f1)
    
    return {
        'mean_f1': np.mean(f1_samples),
        'std_f1': np.std(f1_samples),
        'range': (np.min(f1_samples), np.max(f1_samples)),
        'is_robust': np.std(f1_samples) < 0.01  # Arbitrary threshold
    }

Model Selection Stability

If small weight changes flip which model appears best, your evaluation is unstable. Either refine your weights with stakeholder input, or report results across a range of plausible weight vectors.

Practical Guidelines

Weight Selection Best Practices

•Start with support weights as baseline — They're interpretable and well-understood. Deviate only when you have specific reasons.
•Document weight rationale — Record why each class has its weight. This is a design decision that should be reviewed.
•Validate with stakeholders — Present weight implications in business terms: 'We're treating one fraud catch as worth 100 correctly passed transactions.'
•Run sensitivity analysis — Ensure conclusions hold across reasonable weight variations.
•Consider weight uncertainty — If weights are estimates, propagate that uncertainty to metric confidence intervals.
•Align training and evaluation — If you use class weights during training, use the same weights for evaluation.

Summary: Weighted Averaging

Key Takeaways

•Weighted averaging enables custom priorities — Unlike micro (instance-equal) or macro (class-equal), weighted averaging lets you define exactly how much each class matters.
•Support-weighted is the default — When scikit-learn reports 'weighted' metrics, it uses ground-truth support as weights.
•Cost matrices provide principled weights — Deriving weights from business costs creates direct alignment between metrics and outcomes.
•Sensitivity analysis validates robustness — Ensure your conclusions don't depend on arbitrary weight choices.
•Weight choice is a design decision — Document it, validate it with stakeholders, and apply it consistently.

Page Complete

You now understand weighted averaging as a flexible middle ground between micro and macro approaches. Next, we'll explore Cohen's kappa—a metric that accounts for chance agreement and provides a more robust measure of classifier quality.