Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

TL;DR

Proposes an LLM-based evaluation framework to enhance math reasoning assessment accuracy beyond symbolic math limitations.

cs.AI πŸ”΄ Advanced 2026-04-24 30 views
Erez Yosef Oron Anschel Shunit Haviv Hakimi Asaf Gendler Adam Botach Nimrod Berman Igor Kviatkovsky
large language models mathematical reasoning evaluation framework symbolic comparison machine learning

Key Findings

Methodology

This paper introduces an LLM-based evaluation framework for verifying answers to mathematical reasoning problems. The framework does not rely on traditional symbolic math comparison but leverages the generalization capabilities and prior knowledge of LLMs to evaluate model-generated answers. It ensures accuracy and consistency through two stages: independent question answering and dataset answer validation. Finally, an LLM acts as a judge to evaluate model predictions, enhancing robustness through multiple assessments and majority voting.

Key Results

  • Result 1: On the Qwen2.5-7B model, the LLM-as-a-judge evaluation method improved accuracy by approximately 2.7% over traditional symbolic evaluation methods, particularly on GSM8K and Minerva datasets.
  • Result 2: Comparing SimpleRL and Lighteval frameworks, the LLM-as-a-judge method showed consistent evaluation results across different frameworks, whereas symbolic evaluation methods exhibited significant discrepancies.
  • Result 3: On a meta-evaluation dataset, the LLM-as-a-judge method achieved an F1 score of 0.969, significantly outperforming the symbolic evaluation method's 0.741.

Significance

This research addresses the limitations of traditional symbolic math evaluation methods in handling diverse mathematical representations and answer formats by introducing the LLM-as-a-judge evaluation framework. For academia, this method provides a more reliable means of assessing mathematical reasoning, enabling more accurate performance monitoring and advancing intelligent systems. For industry, it helps improve the accuracy of solving mathematical problems, especially in applications requiring complex mathematical expressions.

Technical Contribution

Technical contributions include: 1) proposing a method for mathematical answer verification that does not rely on symbolic matching, instead using LLMs' semantic understanding capabilities; 2) reducing evaluation bias and increasing robustness through a multi-stage evaluation process; 3) introducing the pass@k metric to assess the diversity and reliability of model outputs, offering a new evaluation perspective.

Novelty

This study is the first to apply LLMs for final-answer verification in mathematical reasoning, overcoming the limitations of symbolic verification. Compared to existing methods, it can handle diverse mathematical representations and approximation differences, significantly improving evaluation accuracy and robustness.

Limitations

  • Limitation 1: The LLM-as-a-judge method may be limited by the LLM's capabilities when handling certain complex mathematical problems, leading to less accurate evaluation results.
  • Limitation 2: Due to the generative nature of LLMs, evaluation results may be influenced by input response position bias, although random sampling and shuffling can partially mitigate this issue.
  • Limitation 3: On some datasets, errors or inconsistencies within the dataset itself may affect evaluation accuracy, although filtering out inapplicable samples can improve reliability.

Future Work

Future directions include: 1) further optimizing the LLM-as-a-judge evaluation framework to enhance accuracy in handling complex mathematical problems; 2) exploring applications in other domains, such as scientific computing and engineering design; 3) developing more efficient LLM models to reduce evaluation costs and increase speed.

AI Executive Summary

In recent years, large language models (LLMs) have made significant advances in natural language processing and reasoning tasks. However, in mathematical reasoning evaluation, traditional symbolic math comparison methods have limitations, struggling to handle diverse mathematical representations and answer formats. This can lead to inaccurate evaluation results, especially when the answer format differs from expectations.

To address this issue, the paper proposes an LLM-based evaluation framework called LLM-as-a-judge. This framework leverages the generalization capabilities and prior knowledge of LLMs to evaluate model-generated answers without relying on predefined symbolic verification processes. It ensures accuracy and consistency through two stages: independent question answering and dataset answer validation. Finally, an LLM acts as a judge to evaluate model predictions, enhancing robustness through multiple assessments and majority voting.

In experiments, the researchers compared two popular evaluation frameworks: Lighteval and SimpleRL, demonstrating the significant advantages of the LLM-as-a-judge method in handling diverse mathematical representations and answer formats. Results show that this method improved accuracy by approximately 2.7% over traditional symbolic evaluation methods on the Qwen2.5-7B model, particularly on GSM8K and Minerva datasets.

The significance of this research lies in providing a more reliable means of assessing mathematical reasoning for academia and industry, enabling more accurate performance monitoring and advancing intelligent systems. For industry, it helps improve the accuracy of solving mathematical problems, especially in applications requiring complex mathematical expressions.

However, the method also has limitations, such as being potentially limited by the LLM's capabilities when handling certain complex mathematical problems. Additionally, evaluation results may be influenced by input response position bias, although random sampling and shuffling can partially mitigate this issue. Future research directions include further optimizing the evaluation framework to enhance accuracy in handling complex mathematical problems and exploring applications in other domains.

Deep Analysis

Background

In recent years, large language models (LLMs) have made significant advances in natural language processing and reasoning tasks. Mathematical reasoning, as one of the fundamental tasks for evaluating models' logical reasoning and problem-solving abilities, has been a focus of research. However, traditional mathematical reasoning evaluation methods primarily rely on symbolic mathematics tools, such as SymPy, which have limitations in handling diverse mathematical representations and answer formats. Especially when the model-generated answer format differs from expectations, symbolic comparison methods may lead to inaccurate evaluation results. To address this issue, researchers have begun exploring LLM-based evaluation methods to improve the accuracy and robustness of mathematical reasoning evaluation.

Core Problem

The core problem in mathematical reasoning evaluation is accurately verifying model-generated answers. Traditional symbolic math comparison methods have limitations in handling diverse mathematical representations and answer formats, struggling to generalize to different mathematical expressions and solution formats. This can lead to models being underestimated due to different answer formats, even when the answer is mathematically correct. Additionally, symbolic verification systems assume specific notations and formatting styles as the ground truth answer, increasing uncertainty in evaluation.

Innovation

The core innovation of this paper is the introduction of an LLM-based evaluation framework called LLM-as-a-judge. β€’ This framework does not rely on traditional symbolic math comparison but leverages the generalization capabilities and prior knowledge of LLMs to evaluate model-generated answers. β€’ It ensures accuracy and consistency through two stages: independent question answering and dataset answer validation. β€’ Finally, an LLM acts as a judge to evaluate model predictions, enhancing robustness through multiple assessments and majority voting. β€’ The introduction of the pass@k metric to assess the diversity and reliability of model outputs provides a new evaluation perspective.

Methodology

The paper proposes an LLM-based evaluation framework called LLM-as-a-judge. β€’ First, in the independent question answering stage, the LLM generates candidate answers for each question without providing the dataset's ground truth answer to reduce bias towards the dataset answer. β€’ Then, in the dataset answer validation stage, the LLM evaluates the correctness of the generated answer against the dataset ground truth answer and synthesizes a final validated answer. β€’ Finally, an LLM acts as a judge to evaluate model predictions, enhancing robustness through multiple assessments and majority voting. β€’ Random sampling and shuffling of responses are used to reduce input response position bias.

Experiments

The experimental design includes evaluating the performance of the LLM-as-a-judge method on multiple datasets, such as GSM8K and Minerva. β€’ Experiments are conducted using the Qwen2.5 model series (including 7B, 14B, and 32B parameters) and comparing two popular evaluation frameworks: Lighteval and SimpleRL. β€’ The pass@k metric is used to assess the diversity and reliability of model outputs, and a meta-evaluation is conducted to verify the accuracy of the evaluation method. β€’ On the meta-evaluation dataset, the correctness of model responses is manually annotated for numerical evaluation and quantifying contributions.

Results

Experimental results show that the LLM-as-a-judge method improved accuracy by approximately 2.7% over traditional symbolic evaluation methods on the Qwen2.5-7B model, particularly on GSM8K and Minerva datasets. β€’ Comparing SimpleRL and Lighteval frameworks, the LLM-as-a-judge method showed consistent evaluation results across different frameworks, whereas symbolic evaluation methods exhibited significant discrepancies. β€’ On a meta-evaluation dataset, the LLM-as-a-judge method achieved an F1 score of 0.969, significantly outperforming the symbolic evaluation method's 0.741.

Applications

This method has broad application prospects in academia and industry. β€’ In academia, the LLM-as-a-judge method provides a more reliable means of assessing mathematical reasoning, enabling more accurate performance monitoring and advancing intelligent systems. β€’ In industry, it helps improve the accuracy of solving mathematical problems, especially in applications requiring complex mathematical expressions, such as scientific computing and engineering design.

Limitations & Outlook

Despite the excellent performance of the LLM-as-a-judge method in evaluation accuracy and robustness, there are still some limitations. β€’ First, the LLM-as-a-judge method may be limited by the LLM's capabilities when handling certain complex mathematical problems, leading to less accurate evaluation results. β€’ Second, due to the generative nature of LLMs, evaluation results may be influenced by input response position bias, although random sampling and shuffling can partially mitigate this issue. β€’ Additionally, errors or inconsistencies within the dataset itself may affect evaluation accuracy, although filtering out inapplicable samples can improve reliability. Future research directions include further optimizing the evaluation framework to enhance accuracy in handling complex mathematical problems and exploring applications in other domains.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. Traditional math evaluation methods are like a recipe book that requires you to follow each step precisely. If you change the order, like adding salt before pepper, the recipe considers it wrong, even if the final dish tastes the same. The LLM-as-a-judge method is like an experienced chef who doesn't care if you add salt or pepper first, as long as the final dish tastes good. This method uses the flexibility and understanding of LLMs to recognize different mathematical expressions and answer formats, as long as the answer is mathematically correct. In this way, the LLM-as-a-judge method can more accurately evaluate model performance without being troubled by subtle differences in format and representation. It's like a smart chef who can adjust cooking steps flexibly based on different ingredients and conditions to make delicious dishes.

ELI14 Explained like you're 14

Hey there! Have you ever wondered why sometimes in math exams, even if your answer is correct, you don't get full marks because of the format? It's like in a game where you complete a mission, but because you didn't follow the game's specific order, you don't get the reward. Traditional math evaluation methods are like this game system, only looking at whether you followed a specific format. The LLM-as-a-judge method is like a smarter game system that doesn't care how you completed the mission as long as the result is right! This method uses the intelligence of large language models to recognize different mathematical expressions and answer formats, as long as the answer is mathematically correct. This way, we can more accurately evaluate model performance without being troubled by subtle differences in format and representation. Isn't that cool?

Glossary

Large Language Model

A model based on deep learning that can understand and generate natural language text.

Used to evaluate answers to mathematical reasoning problems.

Mathematical Reasoning

A task for evaluating models' logical reasoning and problem-solving abilities.

Used to test the intelligence level of LLMs.

Symbolic Mathematics

A method of performing mathematical calculations and verification using symbols and formulas.

Traditional method for math evaluation.

Evaluation Framework

A system and method for evaluating model performance.

Core to the LLM-as-a-judge method.

Generalization Capability

The ability of a model to perform well on unseen data.

An important feature of LLMs.

Prior Knowledge

Background knowledge acquired by the model during training.

Used to improve evaluation accuracy.

Independent Question Answering

The process where an LLM generates answers without providing the standard answer.

A stage in the evaluation framework.

Dataset Answer Validation

The LLM evaluates the correctness of generated answers against the dataset's standard answers.

A stage in the evaluation framework.

Multiple Assessments

Enhancing evaluation robustness through multiple assessments and majority voting.

Used to reduce evaluation bias.

Pass@k Metric

A metric for assessing the diversity and reliability of model outputs.

Used to evaluate model performance.

Open Questions Unanswered questions from this research

  • 1 How can the LLM-as-a-judge method's accuracy be further improved when handling complex mathematical problems? Current methods may be limited by the LLM's capabilities in certain complex problems, requiring more powerful models or new evaluation strategies.
  • 2 What is the potential for applying the LLM-as-a-judge method in other domains? For example, can this method be applied in scientific computing and engineering design?
  • 3 How can input response position bias in LLM evaluation be further reduced? Although random sampling and shuffling can partially mitigate this issue, further research is needed.
  • 4 How can evaluation accuracy be improved when there are errors or inconsistencies within the dataset itself? Can more intelligent filtering mechanisms be developed to identify and exclude these samples?
  • 5 How can the computational cost of the LLM-as-a-judge method be reduced? Current methods may require significant computational resources, especially when evaluating on large-scale datasets.

Applications

Immediate Applications

Mathematics Education

This method can be used in mathematics education to help teachers more accurately assess students' mathematical reasoning abilities, especially when dealing with complex mathematical expressions.

Scientific Computing

In scientific computing, this method can be used to verify the correctness of complex computational results, improving the reliability and accuracy of computations.

Engineering Design

In engineering design, this method can be used to evaluate the rationality of design schemes, helping engineers better optimize designs.

Long-term Vision

Development of Intelligent Systems

This method helps advance intelligent systems, especially in fields requiring complex mathematical problem-solving, such as autonomous driving and robotics.

Cross-Domain Applications

The method has great potential for application in other fields, such as financial analysis and medical diagnosis, potentially bringing new transformations and opportunities.

Abstract

Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.

cs.AI

References (20)

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun et al.

2021 9219 citations View Analysis β†’

Large Language Models for Data Annotation and Synthesis: A Survey

Zhen Tan, Dawei Li, Song Wang et al.

2024 265 citations View Analysis β†’

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi et al.

2024 1242 citations View Analysis β†’

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Tianhao Wu, Weizhe Yuan, Olga Golovneva et al.

2024 181 citations View Analysis β†’

ProcessBench: Identifying Process Errors in Mathematical Reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang et al.

2024 205 citations View Analysis β†’

Training Verifiers to Solve Math Word Problems

K. Cobbe, Vineet Kosaraju, Mo Bavarian et al.

2021 8250 citations View Analysis β†’

Do Large Language Model Benchmarks Test Reliability?

Joshua Vendrow, Edward Vendrow, Sara Beery et al.

2025 40 citations View Analysis β†’

Let's Verify Step by Step

H. Lightman, Vineet Kosaraju, Yura Burda et al.

2023 2953 citations View Analysis β†’

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai et al.

2024 951 citations View Analysis β†’

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li, Wenjie Wang, Moxin Li et al.

2024 50 citations View Analysis β†’

MathEval: A Comprehensive Benchmark for Evaluating Large Language Models on Mathematical Reasoning Capabilities

Tianqiao Liu, Zui Chen, Zhen Fang et al.

2025 9 citations

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick et al.

2024 412 citations View Analysis β†’

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.

2023 8072 citations View Analysis β†’

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui et al.

2024 855 citations View Analysis β†’

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Andreas Stephan, Dawei Zhu, Matthias Aßenmacher et al.

2024 20 citations View Analysis β†’

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li, Qian Dong, Junjie Chen et al.

2024 424 citations View Analysis β†’

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath et al.

2021 4957 citations View Analysis β†’

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao et al.

2023 820 citations View Analysis β†’

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu et al.

2025 451 citations View Analysis β†’

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqin Chen, Rui Lu et al.

2025 703 citations View Analysis β†’