Multi-Evaluator Quality Assessment Score (Medium) — Practice with Code Visualizer

In modern AI evaluation pipelines, particularly when assessing the quality of generative model outputs, a common and robust approach is to aggregate ratings from multiple evaluators across weighted quality dimensions. This technique, often employed in LLM-as-a-judge frameworks, provides a more reliable assessment than single evaluator ratings by capturing consensus and measuring agreement.

Consider a scenario where you have N evaluators (which could be human annotators, different AI models, or specialized evaluation modules) each rating a response across K distinct quality criteria. Each criterion might represent dimensions such as accuracy, coherence, relevance, style, or safety. Since not all criteria are equally important for a given use case, each dimension is assigned a weight that reflects its relative significance.

The Evaluation Model:

Given:

A matrix of scores where judge_scores[i][j] represents the score (0 to max_score) given by evaluator i for criterion j
A weight vector criteria_weights where criteria_weights[j] is the importance weight for criterion j (all weights sum to 1)
A passing_threshold representing the minimum normalized score required for approval
A max_score representing the maximum achievable score per criterion

Your function should compute:

Criterion Scores: For each criterion j, compute the average score across all evaluators: $$\text{criterion_score}j = \frac{1}{N} \sum{i=1}^{N} \text{judge_scores}[i][j]$$
Weighted Score: Combine the criterion scores using their weights: $$\text{weighted_score} = \sum_{j=1}^{K} \text{criteria_weights}[j] \times \text{criterion_score}_j$$
Normalized Score: Scale the weighted score to a 0-1 range: $$\text{normalized_score} = \frac{\text{weighted_score}}{\text{max_score}}$$
Pass Status: Determine if the normalized score meets or exceeds the passing threshold.
Evaluator Agreement: Calculate a metric (0 to 1) indicating how much evaluators agree with each other. Use a standard deviation-based approach: compute the standard deviation for each criterion, average these values, and compare against the maximum possible standard deviation. Higher agreement (lower variance) yields values closer to 1.

$$\text{agreement} = 1 - \frac{\text{mean_std}}{\text{max_possible_std}}$$

Where max_possible_std is the standard deviation when scores are maximally dispersed (half at 0, half at max_score).

Your Task: Implement a function that takes the evaluator scores, criteria weights, passing threshold, and maximum score, then returns a comprehensive evaluation dictionary with all computed metrics.

Three evaluators assess a response across 3 quality criteria.

Step 1 - Criterion Scores: • Criterion 0 (weight 0.3): Scores [4, 4, 5] → Average = 4.3333 • Criterion 1 (weight 0.5): Scores [5, 4, 5] → Average = 4.6667 • Criterion 2 (weight 0.2): Scores [3, 4, 4] → Average = 3.6667

Step 2 - Weighted Score: (4.3333 × 0.3) + (4.6667 × 0.5) + (3.6667 × 0.2) = 1.3 + 2.3333 + 0.7333 = 4.3667

Step 3 - Normalized Score: 4.3667 / 5.0 = 0.8733

Step 4 - Pass Status: 0.8733 ≥ 0.6 → True (passes the threshold)

Step 5 - Agreement: Standard deviations: [0.4714, 0.4714, 0.4714]. The mean std relative to the maximum possible std gives agreement = 0.8114.

Two evaluators give perfect scores (5) on both criteria.

Step 1 - Criterion Scores: • Criterion 0: [5, 5] → Average = 5.0 • Criterion 1: [5, 5] → Average = 5.0

Step 2 - Weighted Score: (5.0 × 0.5) + (5.0 × 0.5) = 5.0

Step 3 - Normalized Score: 5.0 / 5.0 = 1.0 (perfect score)

Step 4 - Pass Status: 1.0 ≥ 0.6 → True

Step 5 - Agreement: Since all evaluators gave identical scores (zero standard deviation for all criteria), agreement = 1.0 (perfect consensus).

Two evaluators rate a response across 3 criteria, with one consistently scoring lower.

Step 1 - Criterion Scores: • All criteria receive scores [1, 2] → Average = 1.5 for each

Step 2 - Weighted Score: (1.5 × 0.4) + (1.5 × 0.3) + (1.5 × 0.3) = 0.6 + 0.45 + 0.45 = 1.5

Step 3 - Normalized Score: 1.5 / 5.0 = 0.3

Step 4 - Pass Status: 0.3 < 0.6 → False (fails the threshold)

Step 5 - Agreement: Standard deviation of [1, 2] = 0.5 for each criterion. Compared to maximum possible std of 2.5, agreement = 1 - (0.5/2.5) = 0.8.