Loading problem...
In the realm of generative AI and reasoning models, evaluating model performance requires specialized metrics that go beyond simple accuracy measures. When a model generates multiple candidate solutions for a single problem—a common practice in chain-of-thought reasoning and test-time compute scaling—we need principled ways to aggregate these outputs into meaningful performance indicators.
This problem introduces two fundamental evaluation metrics used extensively in modern AI systems:
Sample Accuracy, also known as Pass@1, measures the expected probability that a single randomly sampled response from the model is correct. Given a set of responses where each response is either correct or incorrect, the sample accuracy is simply the proportion of correct responses:
$$\text{Sample Accuracy} = \frac{\text{Number of Correct Responses}}{\text{Total Number of Responses}}$$
This metric answers the question: "If I sample exactly one response from this model, what's the probability it will be correct?"
Consensus Voting, also called Majority Voting or Self-Consistency Decoding, is a powerful technique that aggregates multiple model outputs by selecting the most frequently occurring answer. The insight is that while individual samples may contain errors, the wisdom of the crowd effect can filter out inconsistent mistakes—correct answers tend to be more consistent across samples while errors are often random and diverse.
The consensus answer is determined by:
This technique is particularly effective for:
Implement two functions:
compute_sample_accuracy: Calculate the sample accuracy (pass@1) given a list of boolean correctness indicatorsconsensus_vote: Determine the majority answer given a list of string responsesThese metrics form the foundation of evaluation frameworks used in state-of-the-art language models and reasoning systems.
responses_correct = [True, False, True, False]0.5Out of 4 responses, 2 are correct (True) and 2 are incorrect (False).
Sample Accuracy = 2 ÷ 4 = 0.5
This means if you randomly sample one response from this model, there's a 50% chance it will be correct. This is the pass@1 metric commonly reported in reasoning model evaluations.
responses = ["A", "B", "A", "A", "B"]"A"Counting the frequency of each response: • "A" appears 3 times • "B" appears 2 times
Since "A" has the highest frequency (3 > 2), the consensus vote selects "A" as the majority answer. This demonstrates how majority voting can extract the correct answer even when some individual samples are wrong.
responses_correct = [True, True, True, True]1.0All 4 responses are correct.
Sample Accuracy = 4 ÷ 4 = 1.0
A perfect score of 1.0 indicates that every sampled response was correct. This represents optimal model performance where the model consistently produces correct answers.
Constraints