0/318

00:00:00

Description

Editorial

Multiple Choice Log-Probability Evaluation

MEDIUM20 pts

In the evaluation of large language models (LLMs), one of the most rigorous and widely adopted methodologies is log-probability-based scoring for multiple choice question answering. This approach assesses how well a model can identify correct answers by analyzing the probability distributions it assigns to each answer option.

The Evaluation Framework

When a language model is presented with a multiple choice question, it internally computes a probability distribution over all possible answer choices. Rather than working with raw probabilities (which can suffer from numerical underflow for low-probability events), we use log-probabilities—the natural logarithm of the probability values. This representation offers superior numerical stability and computational efficiency.

For each question, the model produces a log-probability score for every answer option. The predicted answer is the choice that receives the highest log-probability, indicating the model's most confident selection. This aligns with the principle of maximum likelihood estimation.

Key Metrics

Your evaluation function must compute three essential metrics:

1. Predictions: For each question, determine the predicted answer as the index of the answer choice with the maximum log-probability.

2. Accuracy: The proportion of questions where the model's prediction matches the ground truth correct answer. This is calculated as:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Questions}}$$

3. Average Correct Probability: To understand model calibration, we convert log-probabilities back to probabilities using the softmax function and compute the average probability mass assigned to the correct answer across all questions. For a question with log-probabilities $[l_0, l_1, ..., l_{n-1}]$, the probability for choice $i$ is:

$$P_i = \frac{e^{l_i}}{\sum_{j=0}^{n-1} e^{l_j}}$$

The average of $P_{\text{correct}}$ across all questions gives insight into how confident the model is about correct answers.

Numerical Stability

When converting log-probabilities to probabilities via softmax, direct computation of $e^{l_i}$ can cause numerical overflow or underflow. A numerically stable approach involves subtracting the maximum log-probability before exponentiation:

$$P_i = \frac{e^{l_i - l_{\max}}}{\sum_{j=0}^{n-1} e^{l_j - l_{\max}}}$$

where $l_{\max} = \max(l_0, l_1, ..., l_{n-1})$.

Your Task

Implement a function that takes log-probabilities for multiple choice questions and their correct answers, then returns a dictionary containing the accuracy, predicted answers, and average probability assigned to correct answers.

Example

Input

log_probs = [[-1.0, -2.0, -3.0, -4.0], [-2.0, -1.0, -3.0, -4.0]]
correct_answers = [0, 1]

Output

{'accuracy': 1.0, 'predictions': [0, 1], 'avg_correct_prob': 0.6439}

Explanation

Question 1: Log-probabilities are [-1.0, -2.0, -3.0, -4.0] • The highest log-probability is -1.0 at index 0 • Prediction: 0 (matches correct answer 0) ✓ • Using softmax: P(0) = e^(-1.0) / (e^(-1.0) + e^(-2.0) + e^(-3.0) + e^(-4.0)) ≈ 0.6439

Question 2: Log-probabilities are [-2.0, -1.0, -3.0, -4.0] • The highest log-probability is -1.0 at index 1 • Prediction: 1 (matches correct answer 1) ✓ • Using softmax: P(1) = e^(-1.0) / (e^(-2.0) + e^(-1.0) + e^(-3.0) + e^(-4.0)) ≈ 0.6439

Results: • Both predictions are correct → Accuracy = 2/2 = 1.0 • Average probability of correct answers = (0.6439 + 0.6439) / 2 ≈ 0.6439

Example

Input

log_probs = [[-1.0, -2.0, -3.0, -4.0], [-2.0, -1.0, -3.0, -4.0], [-3.0, -4.0, -1.0, -2.0]]
correct_answers = [0, 0, 2]

Output

{'accuracy': 0.6667, 'predictions': [0, 1, 2], 'avg_correct_prob': 0.5082}

Explanation

Question 1: Log-probs = [-1.0, -2.0, -3.0, -4.0] • Prediction: 0 (highest at index 0), Correct: 0 ✓ • P(correct) = softmax([-1.0, -2.0, -3.0, -4.0])[0] ≈ 0.6439

Question 2: Log-probs = [-2.0, -1.0, -3.0, -4.0] • Prediction: 1 (highest at index 1), Correct: 0 ✗ • Model predicted 1, but correct answer is 0 • P(correct) = softmax([-2.0, -1.0, -3.0, -4.0])[0] ≈ 0.2369

Question 3: Log-probs = [-3.0, -4.0, -1.0, -2.0] • Prediction: 2 (highest at index 2), Correct: 2 ✓ • P(correct) = softmax([-3.0, -4.0, -1.0, -2.0])[2] ≈ 0.6439

Results: • 2 out of 3 predictions correct → Accuracy = 2/3 ≈ 0.6667 • Average correct probability = (0.6439 + 0.2369 + 0.6439) / 3 ≈ 0.5082

Example

Input

log_probs = [[-0.5, -1.5, -2.5, -3.5]]
correct_answers = [0]

Output

{'accuracy': 1.0, 'predictions': [0], 'avg_correct_prob': 0.6439}

Explanation

Single Question Evaluation: • Log-probabilities: [-0.5, -1.5, -2.5, -3.5] • The highest log-probability is -0.5 at index 0 • Prediction: 0, which matches the correct answer 0 ✓

Softmax Calculation (with log-max subtraction for stability): • Max log-prob = -0.5 • Shifted log-probs: [0, -1.0, -2.0, -3.0] • Numerators: [e^0, e^(-1), e^(-2), e^(-3)] = [1, 0.368, 0.135, 0.050] • Denominator: 1 + 0.368 + 0.135 + 0.050 ≈ 1.553 • P(0) = 1 / 1.553 ≈ 0.6439

Results: • Accuracy = 1/1 = 1.0 • Average correct probability = 0.6439

Accepted0/0·0% Acceptance

Constraints

1 ≤ number of questions ≤ 10,000
2 ≤ number of answer choices per question ≤ 10
All questions have the same number of answer choices
-100.0 ≤ log_probs[i][j] ≤ 0.0 (valid log-probability range)
0 ≤ correct_answers[i] < number of answer choices
Length of log_probs equals length of correct_answers
Round accuracy to 4 decimal places
Round avg_correct_prob to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

log_probs =

[[-1,-2,-3,-4],[-2,-1,-3,-4]]

correct_answers =

[0,1]

The Evaluation Framework

Key Metrics

Your evaluation function must compute three essential metrics:

1. Predictions: For each question, determine the predicted answer as the index of the answer choice with the maximum log-probability.

2. Accuracy: The proportion of questions where the model's prediction matches the ground truth correct answer. This is calculated as:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Questions}}$$

$$P_i = \frac{e^{l_i}}{\sum_{j=0}^{n-1} e^{l_j}}$$

The average of $P_{\text{correct}}$ across all questions gives insight into how confident the model is about correct answers.

Numerical Stability

$$P_i = \frac{e^{l_i - l_{\max}}}{\sum_{j=0}^{n-1} e^{l_j - l_{\max}}}$$

where $l_{\max} = \max(l_0, l_1, ..., l_{n-1})$.

Multiple Choice Log-Probability Evaluation

The Evaluation Framework

Key Metrics

Numerical Stability

Your Task

Hints

Multiple Choice Log-Probability Evaluation

The Evaluation Framework

Key Metrics

Numerical Stability

Your Task

Hints