Loading problem...
In the era of AI-powered code generation, validating machine-generated code has become a critical component of evaluation pipelines. When assessing Large Language Models (LLMs) on programming tasks, we need robust systems that can automatically determine whether generated code produces correct outputs across a diverse set of test cases.
The Challenge: You are building an Automated Code Assessment System that evaluates the correctness of program outputs. Given a collection of test execution results, your validator must compare expected and actual outputs to determine whether each test case passed, failed, or encountered an execution error.
System Requirements: Your validator receives a list of test case results, where each result is a dictionary containing:
None if execution failed)'success': Code executed without errors'error': Code threw an exception or runtime error'timeout': Code exceeded the time limitValidation Logic:
'error' or 'timeout') should immediately be marked as 'error' in the verdictOutput Format: Your function should return a dictionary with:
pass_rate: The proportion of tests that passed, rounded to 4 decimal placeserror_rate: The proportion of tests with execution errors, rounded to 4 decimal placespassed_count: Integer count of passed teststotal_count: Total number of test casesverdicts: A list of verdict strings ('pass', 'fail', or 'error') corresponding to each input test caseWhy This Matters: This type of validator is essential for benchmarks like HumanEval, MBPP, and APPS that evaluate code generation capabilities. It enables researchers and practitioners to systematically measure model performance, compare different approaches, and track improvements over time. The inclusion of numeric tolerance ensures fair evaluation when dealing with floating-point arithmetic, while the distinct error handling provides insight into code stability versus correctness.
test_cases = [
{'expected': '5', 'actual': '5', 'status': 'success'},
{'expected': 'foo', 'actual': 'bar', 'status': 'success'},
{'expected': '10', 'actual': None, 'status': 'error'}
]{'pass_rate': 0.3333, 'error_rate': 0.3333, 'passed_count': 1, 'total_count': 3, 'verdicts': ['pass', 'fail', 'error']}Analysis of each test case:
• Test Case 1: Status is 'success', and both expected ('5') and actual ('5') outputs match exactly. Verdict: pass ✓
• Test Case 2: Status is 'success', but expected ('foo') does not equal actual ('bar'). Neither can be parsed as numbers, so exact string comparison is used. Verdict: fail ✗
• Test Case 3: Status is 'error', indicating the code crashed or threw an exception. Regardless of expected/actual values, this is marked as an execution failure. Verdict: error ⚠️
Final Metrics:
test_cases = [
{'expected': 'hello', 'actual': 'hello', 'status': 'success'},
{'expected': '42', 'actual': '42', 'status': 'success'},
{'expected': '3.14', 'actual': '3.14', 'status': 'success'}
]{'pass_rate': 1.0, 'error_rate': 0.0, 'passed_count': 3, 'total_count': 3, 'verdicts': ['pass', 'pass', 'pass']}Perfect execution scenario:
• Test Case 1: String 'hello' matches exactly. Verdict: pass ✓
• Test Case 2: Numeric value '42' matches exactly. Verdict: pass ✓
• Test Case 3: Floating-point '3.14' matches exactly. Verdict: pass ✓
Final Metrics: All 3 tests passed with no errors, resulting in a 100% pass rate (1.0) and 0% error rate (0.0). This represents an ideal code submission that correctly handles all test cases.
test_cases = [
{'expected': '3.14159', 'actual': '3.141590001', 'status': 'success'},
{'expected': 'test', 'actual': 'test', 'status': 'success'},
{'expected': '100', 'actual': None, 'status': 'timeout'}
]{'pass_rate': 0.6667, 'error_rate': 0.3333, 'passed_count': 2, 'total_count': 3, 'verdicts': ['pass', 'pass', 'error']}Demonstrating numeric tolerance and timeout handling:
• Test Case 1: Both values parse as floats (3.14159 and 3.141590001). The difference is approximately 0.000000001, which is well within the tolerance of 1e-4. Verdict: pass ✓
• Test Case 2: Exact string match for 'test'. Verdict: pass ✓
• Test Case 3: Status is 'timeout', meaning the code ran too long. This is treated as an execution error regardless of expected output. Verdict: error ⚠️
Final Metrics:
This example showcases the importance of numeric tolerance in fair evaluation—small floating-point differences due to implementation details or hardware should not penalize correct solutions.
Constraints