Code Output Validator for Automated Assessment (Medium) — Practice with Code Visualizer

In the era of AI-powered code generation, validating machine-generated code has become a critical component of evaluation pipelines. When assessing Large Language Models (LLMs) on programming tasks, we need robust systems that can automatically determine whether generated code produces correct outputs across a diverse set of test cases.

The Challenge: You are building an Automated Code Assessment System that evaluates the correctness of program outputs. Given a collection of test execution results, your validator must compare expected and actual outputs to determine whether each test case passed, failed, or encountered an execution error.

System Requirements: Your validator receives a list of test case results, where each result is a dictionary containing:

'expected': The anticipated output as a string
'actual': The output produced by the executed code (may be None if execution failed)
'status': The execution status, which can be:
- 'success': Code executed without errors
- 'error': Code threw an exception or runtime error
- 'timeout': Code exceeded the time limit

Validation Logic:

Execution Errors: Any test with a non-success status (either 'error' or 'timeout') should immediately be marked as 'error' in the verdict
Output Comparison: For successfully executed tests, compare the outputs after stripping leading and trailing whitespace
Numeric Tolerance: When both expected and actual outputs can be parsed as floating-point numbers, apply a tolerance of 1e-4 for comparison (values within this tolerance are considered equal)
String Matching: For non-numeric outputs, perform exact string matching (after whitespace stripping)
Empty Input Handling: If the test case list is empty, return appropriate zero values

Output Format: Your function should return a dictionary with:

pass_rate: The proportion of tests that passed, rounded to 4 decimal places
error_rate: The proportion of tests with execution errors, rounded to 4 decimal places
passed_count: Integer count of passed tests
total_count: Total number of test cases
verdicts: A list of verdict strings ('pass', 'fail', or 'error') corresponding to each input test case

Why This Matters: This type of validator is essential for benchmarks like HumanEval, MBPP, and APPS that evaluate code generation capabilities. It enables researchers and practitioners to systematically measure model performance, compare different approaches, and track improvements over time. The inclusion of numeric tolerance ensures fair evaluation when dealing with floating-point arithmetic, while the distinct error handling provides insight into code stability versus correctness.

Analysis of each test case:

• Test Case 1: Status is 'success', and both expected ('5') and actual ('5') outputs match exactly. Verdict: pass ✓

• Test Case 2: Status is 'success', but expected ('foo') does not equal actual ('bar'). Neither can be parsed as numbers, so exact string comparison is used. Verdict: fail ✗

• Test Case 3: Status is 'error', indicating the code crashed or threw an exception. Regardless of expected/actual values, this is marked as an execution failure. Verdict: error ⚠️

Final Metrics:

Passed: 1 out of 3 → pass_rate = 1/3 ≈ 0.3333
Errors: 1 out of 3 → error_rate = 1/3 ≈ 0.3333

Perfect execution scenario:

• Test Case 1: String 'hello' matches exactly. Verdict: pass ✓

• Test Case 2: Numeric value '42' matches exactly. Verdict: pass ✓

• Test Case 3: Floating-point '3.14' matches exactly. Verdict: pass ✓

Final Metrics: All 3 tests passed with no errors, resulting in a 100% pass rate (1.0) and 0% error rate (0.0). This represents an ideal code submission that correctly handles all test cases.

Demonstrating numeric tolerance and timeout handling:

• Test Case 1: Both values parse as floats (3.14159 and 3.141590001). The difference is approximately 0.000000001, which is well within the tolerance of 1e-4. Verdict: pass ✓

• Test Case 2: Exact string match for 'test'. Verdict: pass ✓

• Test Case 3: Status is 'timeout', meaning the code ran too long. This is treated as an execution error regardless of expected output. Verdict: error ⚠️

Final Metrics:

Passed: 2 out of 3 → pass_rate = 2/3 ≈ 0.6667
Errors: 1 out of 3 → error_rate = 1/3 ≈ 0.3333

This example showcases the importance of numeric tolerance in fair evaluation—small floating-point differences due to implementation details or hardware should not penalize correct solutions.

System Requirements: Your validator receives a list of test case results, where each result is a dictionary containing:

'expected': The anticipated output as a string
'actual': The output produced by the executed code (may be None if execution failed)
'status': The execution status, which can be:
- 'success': Code executed without errors
- 'error': Code threw an exception or runtime error
- 'timeout': Code exceeded the time limit

Validation Logic:

Execution Errors: Any test with a non-success status (either 'error' or 'timeout') should immediately be marked as 'error' in the verdict
Output Comparison: For successfully executed tests, compare the outputs after stripping leading and trailing whitespace
Numeric Tolerance: When both expected and actual outputs can be parsed as floating-point numbers, apply a tolerance of 1e-4 for comparison (values within this tolerance are considered equal)
String Matching: For non-numeric outputs, perform exact string matching (after whitespace stripping)
Empty Input Handling: If the test case list is empty, return appropriate zero values

Output Format: Your function should return a dictionary with:

pass_rate: The proportion of tests that passed, rounded to 4 decimal places
error_rate: The proportion of tests with execution errors, rounded to 4 decimal places
passed_count: Integer count of passed tests
total_count: Total number of test cases
verdicts: A list of verdict strings ('pass', 'fail', or 'error') corresponding to each input test case

Analysis of each test case:

• Test Case 1: Status is 'success', and both expected ('5') and actual ('5') outputs match exactly. Verdict: pass ✓

• Test Case 2: Status is 'success', but expected ('foo') does not equal actual ('bar'). Neither can be parsed as numbers, so exact string comparison is used. Verdict: fail ✗

• Test Case 3: Status is 'error', indicating the code crashed or threw an exception. Regardless of expected/actual values, this is marked as an execution failure. Verdict: error ⚠️

Final Metrics:

Passed: 1 out of 3 → pass_rate = 1/3 ≈ 0.3333
Errors: 1 out of 3 → error_rate = 1/3 ≈ 0.3333

Perfect execution scenario:

• Test Case 1: String 'hello' matches exactly. Verdict: pass ✓

• Test Case 2: Numeric value '42' matches exactly. Verdict: pass ✓

• Test Case 3: Floating-point '3.14' matches exactly. Verdict: pass ✓

Final Metrics: All 3 tests passed with no errors, resulting in a 100% pass rate (1.0) and 0% error rate (0.0). This represents an ideal code submission that correctly handles all test cases.

Demonstrating numeric tolerance and timeout handling:

• Test Case 1: Both values parse as floats (3.14159 and 3.141590001). The difference is approximately 0.000000001, which is well within the tolerance of 1e-4. Verdict: pass ✓

• Test Case 2: Exact string match for 'test'. Verdict: pass ✓

• Test Case 3: Status is 'timeout', meaning the code ran too long. This is treated as an execution error regardless of expected output. Verdict: error ⚠️

Final Metrics:

Passed: 2 out of 3 → pass_rate = 2/3 ≈ 0.6667
Errors: 1 out of 3 → error_rate = 1/3 ≈ 0.3333

This example showcases the importance of numeric tolerance in fair evaluation—small floating-point differences due to implementation details or hardware should not penalize correct solutions.

Code Output Validator for Automated Assessment

Hints

Code Output Validator for Automated Assessment

Hints