00:00:00

Description

Editorial

Weighted Model Response Evaluator

MEDIUM20 pts

In the rapidly evolving landscape of Large Language Model (LLM) development, comparing model outputs is a critical task for researchers, engineers, and product teams. Whether you're selecting a production model, fine-tuning hyperparameters, or validating model improvements, you need a systematic way to determine which model performs better.

Head-to-head evaluation is a widely adopted paradigm where two model responses are compared against each other across multiple quality dimensions. Human annotators or automated evaluation systems assign scores to each response based on predefined criteria, and a winner is determined based on weighted aggregation of these scores.

Problem Statement

Your task is to implement a weighted model response evaluator that analyzes comparison data between two AI models (Model A and Model B) and determines winners, overall statistics, and performance margins.

Computing the Weighted Score:

For each response, the weighted score is computed by:

Normalizing the weights so they sum to 1.0
Computing the weighted sum of individual criterion scores using these normalized weights

$$\text{weighted_score} = \sum_{c \in \text{criteria}} \frac{w_c}{\sum w} \times \text{score}_c$$

Determining the Winner:

Compute the score difference: diff = weighted_score_a - weighted_score_b
Calculate the margin: margin = |diff| (absolute value)
Determine the winner:
- If margin ≤ tie_threshold: result is a tie
- If diff > 0: Model A wins
- If diff < 0: Model B wins

Aggregating Statistics:

After processing all comparisons, compute:

Win Rate A: fraction of comparisons won by Model A
Win Rate B: fraction of comparisons won by Model B
Tie Rate: fraction of comparisons that were ties
Average Margin: mean of all absolute margins

Edge Cases:

For empty comparison lists, return all rates and average margin as 0.0
All output values should be rounded to 4 decimal places

Example

Input

comparisons = [
    {'id': 1, 'scores_a': {'quality': 8, 'relevance': 7}, 'scores_b': {'quality': 6, 'relevance': 8}},
    {'id': 2, 'scores_a': {'quality': 5, 'relevance': 6}, 'scores_b': {'quality': 7, 'relevance': 8}}
]
criteria_weights = {'quality': 0.6, 'relevance': 0.4}
tie_threshold = 0.5

Output

{'results': [{'id': 1, 'winner': 'A', 'margin': 0.8}, {'id': 2, 'winner': 'B', 'margin': 2.0}], 'win_rate_a': 0.5, 'win_rate_b': 0.5, 'tie_rate': 0.0, 'avg_margin': 1.4}

Explanation

Step-by-step computation:

Normalize weights: Total weight = 0.6 + 0.4 = 1.0 (already normalized)

Comparison 1: • Model A weighted score = (8 × 0.6) + (7 × 0.4) = 4.8 + 2.8 = 7.6 • Model B weighted score = (6 × 0.6) + (8 × 0.4) = 3.6 + 3.2 = 6.8 • Difference = 7.6 - 6.8 = 0.8 > 0.5 → Model A wins with margin 0.8

Comparison 2: • Model A weighted score = (5 × 0.6) + (6 × 0.4) = 3.0 + 2.4 = 5.4 • Model B weighted score = (7 × 0.6) + (8 × 0.4) = 4.2 + 3.2 = 7.4 • Difference = 5.4 - 7.4 = -2.0 → Model B wins with margin 2.0

Overall Statistics: • Win Rate A = 1/2 = 0.5 • Win Rate B = 1/2 = 0.5 • Tie Rate = 0/2 = 0.0 • Average Margin = (0.8 + 2.0) / 2 = 1.4

Example

Input

comparisons = [
    {'id': 1, 'scores_a': {'quality': 7, 'fluency': 8}, 'scores_b': {'quality': 8, 'fluency': 7}}
]
criteria_weights = {'quality': 0.5, 'fluency': 0.5}
tie_threshold = 0.5

Output

{'results': [{'id': 1, 'winner': 'tie', 'margin': 0.0}], 'win_rate_a': 0.0, 'win_rate_b': 0.0, 'tie_rate': 1.0, 'avg_margin': 0.0}

Explanation

Comparison 1: • Model A weighted score = (7 × 0.5) + (8 × 0.5) = 3.5 + 4.0 = 7.5 • Model B weighted score = (8 × 0.5) + (7 × 0.5) = 4.0 + 3.5 = 7.5 • Difference = 7.5 - 7.5 = 0.0

Since the margin (0.0) ≤ tie_threshold (0.5), this is declared a tie.

Overall Statistics: • Win Rate A = 0.0, Win Rate B = 0.0, Tie Rate = 1.0 • Average Margin = 0.0

Example

Input

comparisons = [
    {'id': 1, 'scores_a': {'accuracy': 9, 'clarity': 8, 'helpfulness': 9}, 
     'scores_b': {'accuracy': 5, 'clarity': 6, 'helpfulness': 5}}
]
criteria_weights = {'accuracy': 0.4, 'clarity': 0.3, 'helpfulness': 0.3}
tie_threshold = 0.1

Output

{'results': [{'id': 1, 'winner': 'A', 'margin': 3.4}], 'win_rate_a': 1.0, 'win_rate_b': 0.0, 'tie_rate': 0.0, 'avg_margin': 3.4}

Explanation

Multi-criteria comparison with three evaluation dimensions:

Comparison 1: • Model A weighted score = (9 × 0.4) + (8 × 0.3) + (9 × 0.3) = 3.6 + 2.4 + 2.7 = 8.7 • Model B weighted score = (5 × 0.4) + (6 × 0.3) + (5 × 0.3) = 2.0 + 1.8 + 1.5 = 5.3 • Difference = 8.7 - 5.3 = 3.4

Since 3.4 >> 0.1, Model A wins decisively with a large margin of 3.4.

This example demonstrates a scenario where one model significantly outperforms the other across all quality dimensions.

Accepted0/0·0% Acceptance

Constraints

0 ≤ len(comparisons) ≤ 10,000 (can be empty)
1 ≤ number of criteria ≤ 20
0 ≤ score values ≤ 100 (can be integers or floats)
0 < criteria_weights[c] for all criteria c (all weights are positive)
0 ≤ tie_threshold ≤ 10
All comparison dictionaries have consistent criteria keys matching the weights dictionary
Comparison IDs are unique integers

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

comparisons =

[{"id":1,"scores_a":{"quality":8,"relevance":7},"scores_b":{"quality":6,"relevance":8}},{"id":2,"scores_a":{"quality":5,"relevance":6},"scores_b":{"quality":7,"relevance":8}}]

tie_threshold =

0.5

criteria_weights =

{"quality":0.6,"relevance":0.4}

Loading problem...

101

00:00:00

Description

Editorial

Weighted Model Response Evaluator

MEDIUM20 pts

Problem Statement

Computing the Weighted Score:

For each response, the weighted score is computed by:

Normalizing the weights so they sum to 1.0
Computing the weighted sum of individual criterion scores using these normalized weights

$$\text{weighted_score} = \sum_{c \in \text{criteria}} \frac{w_c}{\sum w} \times \text{score}_c$$

Determining the Winner:

Compute the score difference: diff = weighted_score_a - weighted_score_b
Calculate the margin: margin = |diff| (absolute value)
Determine the winner:
- If margin ≤ tie_threshold: result is a tie
- If diff > 0: Model A wins
- If diff < 0: Model B wins

Aggregating Statistics:

After processing all comparisons, compute:

Win Rate A: fraction of comparisons won by Model A
Win Rate B: fraction of comparisons won by Model B
Tie Rate: fraction of comparisons that were ties
Average Margin: mean of all absolute margins

Edge Cases:

For empty comparison lists, return all rates and average margin as 0.0
All output values should be rounded to 4 decimal places

Example

Input

comparisons = [
    {'id': 1, 'scores_a': {'quality': 8, 'relevance': 7}, 'scores_b': {'quality': 6, 'relevance': 8}},
    {'id': 2, 'scores_a': {'quality': 5, 'relevance': 6}, 'scores_b': {'quality': 7, 'relevance': 8}}
]
criteria_weights = {'quality': 0.6, 'relevance': 0.4}
tie_threshold = 0.5

Output

{'results': [{'id': 1, 'winner': 'A', 'margin': 0.8}, {'id': 2, 'winner': 'B', 'margin': 2.0}], 'win_rate_a': 0.5, 'win_rate_b': 0.5, 'tie_rate': 0.0, 'avg_margin': 1.4}

Explanation

Step-by-step computation:

Normalize weights: Total weight = 0.6 + 0.4 = 1.0 (already normalized)

Overall Statistics: • Win Rate A = 1/2 = 0.5 • Win Rate B = 1/2 = 0.5 • Tie Rate = 0/2 = 0.0 • Average Margin = (0.8 + 2.0) / 2 = 1.4

Example

Input

comparisons = [
    {'id': 1, 'scores_a': {'quality': 7, 'fluency': 8}, 'scores_b': {'quality': 8, 'fluency': 7}}
]
criteria_weights = {'quality': 0.5, 'fluency': 0.5}
tie_threshold = 0.5

Output

{'results': [{'id': 1, 'winner': 'tie', 'margin': 0.0}], 'win_rate_a': 0.0, 'win_rate_b': 0.0, 'tie_rate': 1.0, 'avg_margin': 0.0}

Explanation

Comparison 1: • Model A weighted score = (7 × 0.5) + (8 × 0.5) = 3.5 + 4.0 = 7.5 • Model B weighted score = (8 × 0.5) + (7 × 0.5) = 4.0 + 3.5 = 7.5 • Difference = 7.5 - 7.5 = 0.0

Since the margin (0.0) ≤ tie_threshold (0.5), this is declared a tie.

Overall Statistics: • Win Rate A = 0.0, Win Rate B = 0.0, Tie Rate = 1.0 • Average Margin = 0.0

Example

Input

comparisons = [
    {'id': 1, 'scores_a': {'accuracy': 9, 'clarity': 8, 'helpfulness': 9}, 
     'scores_b': {'accuracy': 5, 'clarity': 6, 'helpfulness': 5}}
]
criteria_weights = {'accuracy': 0.4, 'clarity': 0.3, 'helpfulness': 0.3}
tie_threshold = 0.1

Output

{'results': [{'id': 1, 'winner': 'A', 'margin': 3.4}], 'win_rate_a': 1.0, 'win_rate_b': 0.0, 'tie_rate': 0.0, 'avg_margin': 3.4}

Explanation

Multi-criteria comparison with three evaluation dimensions:

Since 3.4 >> 0.1, Model A wins decisively with a large margin of 3.4.

This example demonstrates a scenario where one model significantly outperforms the other across all quality dimensions.

Accepted0/0·0% Acceptance

Constraints

0 ≤ len(comparisons) ≤ 10,000 (can be empty)
1 ≤ number of criteria ≤ 20
0 ≤ score values ≤ 100 (can be integers or floats)
0 < criteria_weights[c] for all criteria c (all weights are positive)
0 ≤ tie_threshold ≤ 10
All comparison dictionaries have consistent criteria keys matching the weights dictionary
Comparison IDs are unique integers

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

comparisons =

[{"id":1,"scores_a":{"quality":8,"relevance":7},"scores_b":{"quality":6,"relevance":8}},{"id":2,"scores_a":{"quality":5,"relevance":6},"scores_b":{"quality":7,"relevance":8}}]

tie_threshold =

0.5

criteria_weights =

{"quality":0.6,"relevance":0.4}

Weighted Model Response Evaluator

Problem Statement

Hints

Weighted Model Response Evaluator

Problem Statement

Hints