Loading problem...
In modern AI evaluation pipelines, particularly when assessing the quality of generative model outputs, a common and robust approach is to aggregate ratings from multiple evaluators across weighted quality dimensions. This technique, often employed in LLM-as-a-judge frameworks, provides a more reliable assessment than single evaluator ratings by capturing consensus and measuring agreement.
Consider a scenario where you have N evaluators (which could be human annotators, different AI models, or specialized evaluation modules) each rating a response across K distinct quality criteria. Each criterion might represent dimensions such as accuracy, coherence, relevance, style, or safety. Since not all criteria are equally important for a given use case, each dimension is assigned a weight that reflects its relative significance.
The Evaluation Model:
Given:
judge_scores[i][j] represents the score (0 to max_score) given by evaluator i for criterion jcriteria_weights where criteria_weights[j] is the importance weight for criterion j (all weights sum to 1)passing_threshold representing the minimum normalized score required for approvalmax_score representing the maximum achievable score per criterionYour function should compute:
Criterion Scores: For each criterion j, compute the average score across all evaluators:
$$\text{criterion_score}j = \frac{1}{N} \sum{i=1}^{N} \text{judge_scores}[i][j]$$
Weighted Score: Combine the criterion scores using their weights: $$\text{weighted_score} = \sum_{j=1}^{K} \text{criteria_weights}[j] \times \text{criterion_score}_j$$
Normalized Score: Scale the weighted score to a 0-1 range: $$\text{normalized_score} = \frac{\text{weighted_score}}{\text{max_score}}$$
Pass Status: Determine if the normalized score meets or exceeds the passing threshold.
Evaluator Agreement: Calculate a metric (0 to 1) indicating how much evaluators agree with each other. Use a standard deviation-based approach: compute the standard deviation for each criterion, average these values, and compare against the maximum possible standard deviation. Higher agreement (lower variance) yields values closer to 1.
$$\text{agreement} = 1 - \frac{\text{mean_std}}{\text{max_possible_std}}$$
Where max_possible_std is the standard deviation when scores are maximally dispersed (half at 0, half at max_score).
Your Task: Implement a function that takes the evaluator scores, criteria weights, passing threshold, and maximum score, then returns a comprehensive evaluation dictionary with all computed metrics.
judge_scores = [[4, 5, 3], [4, 4, 4], [5, 5, 4]]
criteria_weights = [0.3, 0.5, 0.2]
passing_threshold = 0.6
max_score = 5.0{'weighted_score': 4.3667, 'normalized_score': 0.8733, 'criterion_scores': [4.3333, 4.6667, 3.6667], 'pass_status': True, 'judge_agreement': 0.8114}Three evaluators assess a response across 3 quality criteria.
Step 1 - Criterion Scores: • Criterion 0 (weight 0.3): Scores [4, 4, 5] → Average = 4.3333 • Criterion 1 (weight 0.5): Scores [5, 4, 5] → Average = 4.6667 • Criterion 2 (weight 0.2): Scores [3, 4, 4] → Average = 3.6667
Step 2 - Weighted Score: (4.3333 × 0.3) + (4.6667 × 0.5) + (3.6667 × 0.2) = 1.3 + 2.3333 + 0.7333 = 4.3667
Step 3 - Normalized Score: 4.3667 / 5.0 = 0.8733
Step 4 - Pass Status: 0.8733 ≥ 0.6 → True (passes the threshold)
Step 5 - Agreement: Standard deviations: [0.4714, 0.4714, 0.4714]. The mean std relative to the maximum possible std gives agreement = 0.8114.
judge_scores = [[5, 5], [5, 5]]
criteria_weights = [0.5, 0.5]
passing_threshold = 0.6
max_score = 5.0{'weighted_score': 5.0, 'normalized_score': 1.0, 'criterion_scores': [5.0, 5.0], 'pass_status': True, 'judge_agreement': 1.0}Two evaluators give perfect scores (5) on both criteria.
Step 1 - Criterion Scores: • Criterion 0: [5, 5] → Average = 5.0 • Criterion 1: [5, 5] → Average = 5.0
Step 2 - Weighted Score: (5.0 × 0.5) + (5.0 × 0.5) = 5.0
Step 3 - Normalized Score: 5.0 / 5.0 = 1.0 (perfect score)
Step 4 - Pass Status: 1.0 ≥ 0.6 → True
Step 5 - Agreement: Since all evaluators gave identical scores (zero standard deviation for all criteria), agreement = 1.0 (perfect consensus).
judge_scores = [[1, 1, 1], [2, 2, 2]]
criteria_weights = [0.4, 0.3, 0.3]
passing_threshold = 0.6
max_score = 5.0{'weighted_score': 1.5, 'normalized_score': 0.3, 'criterion_scores': [1.5, 1.5, 1.5], 'pass_status': False, 'judge_agreement': 0.8}Two evaluators rate a response across 3 criteria, with one consistently scoring lower.
Step 1 - Criterion Scores: • All criteria receive scores [1, 2] → Average = 1.5 for each
Step 2 - Weighted Score: (1.5 × 0.4) + (1.5 × 0.3) + (1.5 × 0.3) = 0.6 + 0.45 + 0.45 = 1.5
Step 3 - Normalized Score: 1.5 / 5.0 = 0.3
Step 4 - Pass Status: 0.3 < 0.6 → False (fails the threshold)
Step 5 - Agreement: Standard deviation of [1, 2] = 0.5 for each criterion. Compared to maximum possible std of 2.5, agreement = 1 - (0.5/2.5) = 0.8.
Constraints