Loading content...
In the era of Large Language Models (LLMs), evaluating the correctness of mathematical reasoning has become a critical challenge. When an LLM generates a mathematical answer, it may express the solution in a form that differs from the expected ground truth—yet both answers could be mathematically equivalent. For example, 0.5, 1/2, and 2/4 all represent the same value, and a robust evaluation system must recognize this equivalence.
Your Task: Implement a function that determines whether two mathematical answer strings are semantically equivalent, accounting for various representational differences while maintaining numerical precision.
The Core Challenge: Building a reliable math answer validator requires handling multiple expression formats and parsing them into comparable numerical values. Your function must be both precise (correctly identifying equivalences) and safe (gracefully handling unparseable or invalid expressions).
Expression Types to Handle:
Direct String Equality: If both strings are identical, they are trivially equivalent.
Numeric Values: Parse and compare integers (e.g., 42) and floating-point numbers (e.g., 3.14159) with tolerance-based comparison.
Fraction Expressions: Evaluate fractions like 1/2, -3/4, or 22/7 by performing the division and comparing the result.
Square Root Expressions: Parse expressions containing sqrt(n) and evaluate them mathematically. For example, sqrt(4) should equal 2.
Pi Expressions: Recognize pi as the mathematical constant π (approximately 3.14159265...) and handle expressions like 2*pi or pi/2.
Arithmetic Combinations: Handle basic arithmetic operations combining the above, such as sqrt(2)/2 or 3*pi/4.
Comparison Logic: Two parsed numerical values should be considered equivalent if their absolute difference is less than or equal to the specified tolerance (default: 1e-6).
Edge Case Handling:
True immediatelyFalsepredicted = '1/2'
ground_truth = '0.5'TrueThe function first checks for string equality (they differ). It then parses '1/2' as a fraction, computing 1 ÷ 2 = 0.5. The ground truth '0.5' is parsed directly as a float. Since |0.5 - 0.5| = 0 ≤ 1e-6 (the default tolerance), the function returns True, confirming mathematical equivalence.
predicted = '42'
ground_truth = '42'TrueBoth strings are identical, so the function immediately returns True via the direct string equality check. No numerical parsing is required for this trivial case.
predicted = 'sqrt(4)'
ground_truth = '2'TrueThe function parses 'sqrt(4)' by identifying the sqrt() function and computing √4 = 2.0. The ground truth '2' is parsed as the integer 2. Since |2.0 - 2| = 0 ≤ 1e-6, the answers are considered equivalent.
Constraints