Loading content...
In the evaluation of large language models (LLMs), one of the most rigorous and widely adopted methodologies is log-probability-based scoring for multiple choice question answering. This approach assesses how well a model can identify correct answers by analyzing the probability distributions it assigns to each answer option.
When a language model is presented with a multiple choice question, it internally computes a probability distribution over all possible answer choices. Rather than working with raw probabilities (which can suffer from numerical underflow for low-probability events), we use log-probabilities—the natural logarithm of the probability values. This representation offers superior numerical stability and computational efficiency.
For each question, the model produces a log-probability score for every answer option. The predicted answer is the choice that receives the highest log-probability, indicating the model's most confident selection. This aligns with the principle of maximum likelihood estimation.
Your evaluation function must compute three essential metrics:
1. Predictions: For each question, determine the predicted answer as the index of the answer choice with the maximum log-probability.
2. Accuracy: The proportion of questions where the model's prediction matches the ground truth correct answer. This is calculated as:
$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Questions}}$$
3. Average Correct Probability: To understand model calibration, we convert log-probabilities back to probabilities using the softmax function and compute the average probability mass assigned to the correct answer across all questions. For a question with log-probabilities $[l_0, l_1, ..., l_{n-1}]$, the probability for choice $i$ is:
$$P_i = \frac{e^{l_i}}{\sum_{j=0}^{n-1} e^{l_j}}$$
The average of $P_{\text{correct}}$ across all questions gives insight into how confident the model is about correct answers.
When converting log-probabilities to probabilities via softmax, direct computation of $e^{l_i}$ can cause numerical overflow or underflow. A numerically stable approach involves subtracting the maximum log-probability before exponentiation:
$$P_i = \frac{e^{l_i - l_{\max}}}{\sum_{j=0}^{n-1} e^{l_j - l_{\max}}}$$
where $l_{\max} = \max(l_0, l_1, ..., l_{n-1})$.
Implement a function that takes log-probabilities for multiple choice questions and their correct answers, then returns a dictionary containing the accuracy, predicted answers, and average probability assigned to correct answers.
log_probs = [[-1.0, -2.0, -3.0, -4.0], [-2.0, -1.0, -3.0, -4.0]]
correct_answers = [0, 1]{'accuracy': 1.0, 'predictions': [0, 1], 'avg_correct_prob': 0.6439}Question 1: Log-probabilities are [-1.0, -2.0, -3.0, -4.0] • The highest log-probability is -1.0 at index 0 • Prediction: 0 (matches correct answer 0) ✓ • Using softmax: P(0) = e^(-1.0) / (e^(-1.0) + e^(-2.0) + e^(-3.0) + e^(-4.0)) ≈ 0.6439
Question 2: Log-probabilities are [-2.0, -1.0, -3.0, -4.0] • The highest log-probability is -1.0 at index 1 • Prediction: 1 (matches correct answer 1) ✓ • Using softmax: P(1) = e^(-1.0) / (e^(-2.0) + e^(-1.0) + e^(-3.0) + e^(-4.0)) ≈ 0.6439
Results: • Both predictions are correct → Accuracy = 2/2 = 1.0 • Average probability of correct answers = (0.6439 + 0.6439) / 2 ≈ 0.6439
log_probs = [[-1.0, -2.0, -3.0, -4.0], [-2.0, -1.0, -3.0, -4.0], [-3.0, -4.0, -1.0, -2.0]]
correct_answers = [0, 0, 2]{'accuracy': 0.6667, 'predictions': [0, 1, 2], 'avg_correct_prob': 0.5082}Question 1: Log-probs = [-1.0, -2.0, -3.0, -4.0] • Prediction: 0 (highest at index 0), Correct: 0 ✓ • P(correct) = softmax([-1.0, -2.0, -3.0, -4.0])[0] ≈ 0.6439
Question 2: Log-probs = [-2.0, -1.0, -3.0, -4.0] • Prediction: 1 (highest at index 1), Correct: 0 ✗ • Model predicted 1, but correct answer is 0 • P(correct) = softmax([-2.0, -1.0, -3.0, -4.0])[0] ≈ 0.2369
Question 3: Log-probs = [-3.0, -4.0, -1.0, -2.0] • Prediction: 2 (highest at index 2), Correct: 2 ✓ • P(correct) = softmax([-3.0, -4.0, -1.0, -2.0])[2] ≈ 0.6439
Results: • 2 out of 3 predictions correct → Accuracy = 2/3 ≈ 0.6667 • Average correct probability = (0.6439 + 0.2369 + 0.6439) / 3 ≈ 0.5082
log_probs = [[-0.5, -1.5, -2.5, -3.5]]
correct_answers = [0]{'accuracy': 1.0, 'predictions': [0], 'avg_correct_prob': 0.6439}Single Question Evaluation: • Log-probabilities: [-0.5, -1.5, -2.5, -3.5] • The highest log-probability is -0.5 at index 0 • Prediction: 0, which matches the correct answer 0 ✓
Softmax Calculation (with log-max subtraction for stability): • Max log-prob = -0.5 • Shifted log-probs: [0, -1.0, -2.0, -3.0] • Numerators: [e^0, e^(-1), e^(-2), e^(-3)] = [1, 0.368, 0.135, 0.050] • Denominator: 1 + 0.368 + 0.135 + 0.050 ≈ 1.553 • P(0) = 1 / 1.553 ≈ 0.6439
Results: • Accuracy = 1/1 = 1.0 • Average correct probability = 0.6439
Constraints