Loading problem...
In the rapidly evolving landscape of Large Language Model (LLM) development, comparing model outputs is a critical task for researchers, engineers, and product teams. Whether you're selecting a production model, fine-tuning hyperparameters, or validating model improvements, you need a systematic way to determine which model performs better.
Head-to-head evaluation is a widely adopted paradigm where two model responses are compared against each other across multiple quality dimensions. Human annotators or automated evaluation systems assign scores to each response based on predefined criteria, and a winner is determined based on weighted aggregation of these scores.
Your task is to implement a weighted model response evaluator that analyzes comparison data between two AI models (Model A and Model B) and determines winners, overall statistics, and performance margins.
Computing the Weighted Score:
For each response, the weighted score is computed by:
$$\text{weighted_score} = \sum_{c \in \text{criteria}} \frac{w_c}{\sum w} \times \text{score}_c$$
Determining the Winner:
diff = weighted_score_a - weighted_score_bmargin = |diff| (absolute value)margin ≤ tie_threshold: result is a tiediff > 0: Model A winsdiff < 0: Model B winsAggregating Statistics:
After processing all comparisons, compute:
Edge Cases:
comparisons = [
{'id': 1, 'scores_a': {'quality': 8, 'relevance': 7}, 'scores_b': {'quality': 6, 'relevance': 8}},
{'id': 2, 'scores_a': {'quality': 5, 'relevance': 6}, 'scores_b': {'quality': 7, 'relevance': 8}}
]
criteria_weights = {'quality': 0.6, 'relevance': 0.4}
tie_threshold = 0.5{'results': [{'id': 1, 'winner': 'A', 'margin': 0.8}, {'id': 2, 'winner': 'B', 'margin': 2.0}], 'win_rate_a': 0.5, 'win_rate_b': 0.5, 'tie_rate': 0.0, 'avg_margin': 1.4}Step-by-step computation:
Normalize weights: Total weight = 0.6 + 0.4 = 1.0 (already normalized)
Comparison 1: • Model A weighted score = (8 × 0.6) + (7 × 0.4) = 4.8 + 2.8 = 7.6 • Model B weighted score = (6 × 0.6) + (8 × 0.4) = 3.6 + 3.2 = 6.8 • Difference = 7.6 - 6.8 = 0.8 > 0.5 → Model A wins with margin 0.8
Comparison 2: • Model A weighted score = (5 × 0.6) + (6 × 0.4) = 3.0 + 2.4 = 5.4 • Model B weighted score = (7 × 0.6) + (8 × 0.4) = 4.2 + 3.2 = 7.4 • Difference = 5.4 - 7.4 = -2.0 → Model B wins with margin 2.0
Overall Statistics: • Win Rate A = 1/2 = 0.5 • Win Rate B = 1/2 = 0.5 • Tie Rate = 0/2 = 0.0 • Average Margin = (0.8 + 2.0) / 2 = 1.4
comparisons = [
{'id': 1, 'scores_a': {'quality': 7, 'fluency': 8}, 'scores_b': {'quality': 8, 'fluency': 7}}
]
criteria_weights = {'quality': 0.5, 'fluency': 0.5}
tie_threshold = 0.5{'results': [{'id': 1, 'winner': 'tie', 'margin': 0.0}], 'win_rate_a': 0.0, 'win_rate_b': 0.0, 'tie_rate': 1.0, 'avg_margin': 0.0}Comparison 1: • Model A weighted score = (7 × 0.5) + (8 × 0.5) = 3.5 + 4.0 = 7.5 • Model B weighted score = (8 × 0.5) + (7 × 0.5) = 4.0 + 3.5 = 7.5 • Difference = 7.5 - 7.5 = 0.0
Since the margin (0.0) ≤ tie_threshold (0.5), this is declared a tie.
Overall Statistics: • Win Rate A = 0.0, Win Rate B = 0.0, Tie Rate = 1.0 • Average Margin = 0.0
comparisons = [
{'id': 1, 'scores_a': {'accuracy': 9, 'clarity': 8, 'helpfulness': 9},
'scores_b': {'accuracy': 5, 'clarity': 6, 'helpfulness': 5}}
]
criteria_weights = {'accuracy': 0.4, 'clarity': 0.3, 'helpfulness': 0.3}
tie_threshold = 0.1{'results': [{'id': 1, 'winner': 'A', 'margin': 3.4}], 'win_rate_a': 1.0, 'win_rate_b': 0.0, 'tie_rate': 0.0, 'avg_margin': 3.4}Multi-criteria comparison with three evaluation dimensions:
Comparison 1: • Model A weighted score = (9 × 0.4) + (8 × 0.3) + (9 × 0.3) = 3.6 + 2.4 + 2.7 = 8.7 • Model B weighted score = (5 × 0.4) + (6 × 0.3) + (5 × 0.3) = 2.0 + 1.8 + 1.5 = 5.3 • Difference = 8.7 - 5.3 = 3.4
Since 3.4 >> 0.1, Model A wins decisively with a large margin of 3.4.
This example demonstrates a scenario where one model significantly outperforms the other across all quality dimensions.
Constraints