Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
Jo, Lee, Lee et al.
Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
academic
Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
This paper investigates a critical issue in evaluating the reasoning capabilities of large language models (LLMs): the significant impact of answer extraction methods on model performance assessment. The research reveals that the performance of reasoning models and the distribution of final answers are highly dependent on the adopted answer extraction algorithm. To address this problem, the authors propose an "Answer Regeneration" framework, which achieves robust evaluation independent of extraction rules through an additional model reasoning step that regenerates the final answer using an "Answer:" prefix.
Traditional LLM evaluation typically relies on probability distributions of answer selection, but for models requiring reasoning, answer extraction methods become critical. Existing rule-based extraction methods suffer from the following issues:
Format Diversity: Reasoning model outputs vary widely in format, and a single extraction rule cannot cover all cases
Inter-model Differences: Different models use different answer formats, requiring customized extraction rules for each model
Evaluation Inconsistency: The same model output may receive completely different evaluation results depending on the extraction rule used
Reproducibility Issues: Discrepancies between publicly reported performance and reproduction results may stem from undisclosed answer extraction methods
Evaluation Fairness: Rule-based methods may introduce bias toward certain models
Special Characteristics of Reasoning Models: The complexity of Chain-of-Thought (CoT) reasoning outputs renders traditional evaluation methods inadequate
First systematic study of the sensitivity of answer extraction methods to reasoning model evaluation, revealing this overlooked but critical problem
Proposes the Answer Regeneration framework, achieving robust evaluation independent of extraction rules
Demonstrates the generalizability of the method, achieving improvements across multiple task types including multiple-choice questions, mathematical problems, and open-ended question answering
Provides more reliable model ranking, making evaluation results more intuitive (e.g., larger models outperforming smaller ones)
Given the output of a reasoning model (containing the complete reasoning process), the task is to accurately extract its final answer for evaluation. Traditional methods rely on manually crafted regular expression rules, while this paper proposes a generative solution.
Unlike traditional pattern matching, this approach leverages the model's own generative capabilities to "rephrase" the final answer, avoiding the complexity of format parsing.
Separates the reasoning process from answer generation, with the reasoning phase focusing on the thought process and the generation phase focusing on answer output.
Prior research has focused on prompt variations at the input level affecting performance, but lacks systematic study of answer extraction at the output level.
First systematic revelation of answer extraction as an overlooked yet critical problem in reasoning model evaluation, with significant implications for the field.
Hendrycks et al. (2021). Measuring massive multitask language understanding. ICLR.
Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
Liang et al. (2023). Holistic evaluation of language models. arXiv.
Wang et al. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. NeurIPS.
Summary: While this paper is relatively simple in technical innovation, it identifies and addresses an important problem in reasoning model evaluation. The proposed Answer Regeneration framework provides a practical solution for fair and robust evaluation of reasoning models, with significant implications for standardization and reproducibility in the field. Despite limitations such as computational overhead, its practical value and contribution to evaluation methodology make it a valuable research contribution.