Sample Accuracy and Consensus Voting Metrics (Easy) — Practice with Code Visualizer

In the realm of generative AI and reasoning models, evaluating model performance requires specialized metrics that go beyond simple accuracy measures. When a model generates multiple candidate solutions for a single problem—a common practice in chain-of-thought reasoning and test-time compute scaling—we need principled ways to aggregate these outputs into meaningful performance indicators.

This problem introduces two fundamental evaluation metrics used extensively in modern AI systems:

1. Sample Accuracy (Pass@1)

Sample Accuracy, also known as Pass@1, measures the expected probability that a single randomly sampled response from the model is correct. Given a set of responses where each response is either correct or incorrect, the sample accuracy is simply the proportion of correct responses:

$$\text{Sample Accuracy} = \frac{\text{Number of Correct Responses}}{\text{Total Number of Responses}}$$

This metric answers the question: "If I sample exactly one response from this model, what's the probability it will be correct?"

2. Consensus Voting (Majority Vote)

Consensus Voting, also called Majority Voting or Self-Consistency Decoding, is a powerful technique that aggregates multiple model outputs by selecting the most frequently occurring answer. The insight is that while individual samples may contain errors, the wisdom of the crowd effect can filter out inconsistent mistakes—correct answers tend to be more consistent across samples while errors are often random and diverse.

The consensus answer is determined by:

Counting the frequency of each unique response
Selecting the response that appears most often

This technique is particularly effective for:

Mathematical reasoning: Where there's a single correct numerical answer
Multiple-choice questions: Where responses must match exact options
Code generation: Where functionality can be verified deterministically

Your Task

Implement two functions:

compute_sample_accuracy: Calculate the sample accuracy (pass@1) given a list of boolean correctness indicators
consensus_vote: Determine the majority answer given a list of string responses

These metrics form the foundation of evaluation frameworks used in state-of-the-art language models and reasoning systems.

Out of 4 responses, 2 are correct (True) and 2 are incorrect (False).

Sample Accuracy = 2 ÷ 4 = 0.5

This means if you randomly sample one response from this model, there's a 50% chance it will be correct. This is the pass@1 metric commonly reported in reasoning model evaluations.

Counting the frequency of each response: • "A" appears 3 times • "B" appears 2 times

Since "A" has the highest frequency (3 > 2), the consensus vote selects "A" as the majority answer. This demonstrates how majority voting can extract the correct answer even when some individual samples are wrong.

All 4 responses are correct.

Sample Accuracy = 4 ÷ 4 = 1.0

A perfect score of 1.0 indicates that every sampled response was correct. This represents optimal model performance where the model consistently produces correct answers.

This problem introduces two fundamental evaluation metrics used extensively in modern AI systems:

1. Sample Accuracy (Pass@1)

$$\text{Sample Accuracy} = \frac{\text{Number of Correct Responses}}{\text{Total Number of Responses}}$$

This metric answers the question: "If I sample exactly one response from this model, what's the probability it will be correct?"

2. Consensus Voting (Majority Vote)

The consensus answer is determined by:

Counting the frequency of each unique response
Selecting the response that appears most often

This technique is particularly effective for:

Mathematical reasoning: Where there's a single correct numerical answer
Multiple-choice questions: Where responses must match exact options
Code generation: Where functionality can be verified deterministically

Your Task

Implement two functions:

compute_sample_accuracy: Calculate the sample accuracy (pass@1) given a list of boolean correctness indicators
consensus_vote: Determine the majority answer given a list of string responses

These metrics form the foundation of evaluation frameworks used in state-of-the-art language models and reasoning systems.

Out of 4 responses, 2 are correct (True) and 2 are incorrect (False).

Sample Accuracy = 2 ÷ 4 = 0.5

This means if you randomly sample one response from this model, there's a 50% chance it will be correct. This is the pass@1 metric commonly reported in reasoning model evaluations.

Counting the frequency of each response: • "A" appears 3 times • "B" appears 2 times

All 4 responses are correct.

Sample Accuracy = 4 ÷ 4 = 1.0

A perfect score of 1.0 indicates that every sampled response was correct. This represents optimal model performance where the model consistently produces correct answers.

Sample Accuracy and Consensus Voting Metrics

1. Sample Accuracy (Pass@1)

2. Consensus Voting (Majority Vote)

Your Task

Hints

Sample Accuracy and Consensus Voting Metrics

1. Sample Accuracy (Pass@1)

2. Consensus Voting (Majority Vote)

Your Task

Hints