2025-11-18T06:58:13.108824

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Jo, Lee, Lee et al.
Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
academic

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Basic Information

  • Paper ID: 2510.14773
  • Title: Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
  • Authors: Hwiyeol Jo, Joosung Lee, Jaehong Lee, Sang-Woo Lee, Joonsuk Park, Kang Min Yoo
  • Classification: cs.CL cs.AI
  • Publication Date: October 16, 2024
  • Paper Link: https://arxiv.org/abs/2510.14773

Abstract

This paper investigates a critical issue in evaluating the reasoning capabilities of large language models (LLMs): the significant impact of answer extraction methods on model performance assessment. The research reveals that the performance of reasoning models and the distribution of final answers are highly dependent on the adopted answer extraction algorithm. To address this problem, the authors propose an "Answer Regeneration" framework, which achieves robust evaluation independent of extraction rules through an additional model reasoning step that regenerates the final answer using an "Answer:" prefix.

Research Background and Motivation

Core Problem

Traditional LLM evaluation typically relies on probability distributions of answer selection, but for models requiring reasoning, answer extraction methods become critical. Existing rule-based extraction methods suffer from the following issues:

  1. Format Diversity: Reasoning model outputs vary widely in format, and a single extraction rule cannot cover all cases
  2. Inter-model Differences: Different models use different answer formats, requiring customized extraction rules for each model
  3. Evaluation Inconsistency: The same model output may receive completely different evaluation results depending on the extraction rule used

Research Motivation

  • Reproducibility Issues: Discrepancies between publicly reported performance and reproduction results may stem from undisclosed answer extraction methods
  • Evaluation Fairness: Rule-based methods may introduce bias toward certain models
  • Special Characteristics of Reasoning Models: The complexity of Chain-of-Thought (CoT) reasoning outputs renders traditional evaluation methods inadequate

Core Contributions

  1. First systematic study of the sensitivity of answer extraction methods to reasoning model evaluation, revealing this overlooked but critical problem
  2. Proposes the Answer Regeneration framework, achieving robust evaluation independent of extraction rules
  3. Demonstrates the generalizability of the method, achieving improvements across multiple task types including multiple-choice questions, mathematical problems, and open-ended question answering
  4. Provides more reliable model ranking, making evaluation results more intuitive (e.g., larger models outperforming smaller ones)

Methodology Details

Task Definition

Given the output of a reasoning model (containing the complete reasoning process), the task is to accurately extract its final answer for evaluation. Traditional methods rely on manually crafted regular expression rules, while this paper proposes a generative solution.

Answer Regeneration Framework

Overall Architecture

Original Input + Reasoning Output + "Answer:" → Model Reasoning → Simplified Final Answer

Core Steps

  1. Input Preparation: Combine the original question, model's reasoning process, and "Answer:" prompt
  2. Re-reasoning: Use the model (in non-reasoning mode) for an additional reasoning step
  3. Answer Extraction: Extract the final answer from the simplified output

Technical Advantages

  • Probability-based Foundation: For multiple-choice questions, probability-based answer selection can be used
  • Output Simplification: Generated answers have a more concise format, facilitating extraction
  • Rule-independent: Does not rely on complex manual rules

Technical Innovations

1. Generative Answer Extraction

Unlike traditional pattern matching, this approach leverages the model's own generative capabilities to "rephrase" the final answer, avoiding the complexity of format parsing.

2. Reasoning-Generation Separation

Separates the reasoning process from answer generation, with the reasoning phase focusing on the thought process and the generation phase focusing on answer output.

3. Adaptability

The framework automatically adapts to different task types and answer formats without requiring tuning for specific models or tasks.

Experimental Setup

Datasets

  • MMLU: Multi-domain multiple-choice knowledge test, serving as the primary evaluation benchmark
  • MMLU-Pro: More complex multiple-choice benchmark with dynamic option counts
  • GSM8K: Mathematical reasoning problems with short answer format
  • TriviaQA: Open-ended question answering task

Evaluation Models

  • Qwen3 Series: Qwen3-32B, Qwen3-14B, Qwen3-8B
  • DeepSeek-R1 Series: R1-Distill-Llama-8B, R1-Qwen3-8B

Comparison Methods

  1. strict-match: Exact string matching ("answer is X")
  2. flexible-extract: Flexible option extraction (searching for (A), (B), etc.)
  3. instructed-format: Guided format output
  4. answer-is-correct: Optimized strict matching
  5. last-extract: Extract the last uppercase letter

Implementation Details

  • Uses lm-evaluation-harness toolkit
  • Temperature set to 0.6, top-p to 0.95, top-k to 20
  • Maximum generation length limited to 4096 tokens

Experimental Results

Main Results

Significant Performance Fluctuations

Different extraction methods lead to substantial performance variations:

  • Qwen3-32B accuracy range across different methods: 75.8% - 87.1%
  • Model ranking can be completely altered by extraction method

Clear Advantages of Answer Regeneration

Answer Regeneration achieves the best performance across all tested models:

ModelBest Rule-based MethodAnswer RegenerationImprovement
Qwen3-32B82.1%87.1%+5.0%
Qwen3-14B83.8%85.0%+1.2%
Qwen3-8B82.1%83.3%+1.2%
R1-Llama-8B64.8%68.8%+4.0%
R1-Qwen3-8B77.6%80.7%+3.1%

Ablation Studies

Answer Inconsistency Analysis

The same model output may be parsed as different answers by different extraction methods:

  • Some methods extract answers from the reasoning process
  • Some methods extract formatted final answers
  • Some methods fail due to formatting issues

Incomplete Reasoning Handling

Answer Regeneration performs better when handling incomplete reasoning outputs:

  • Traditional methods tend to fail when reasoning is truncated
  • Regeneration method can provide answers based on available information

Human Evaluation Verification

In human evaluation of 300 samples:

  • Answer Regeneration consistency with human annotation: 84.2%
  • Best rule-based method consistency with human annotation: 61.7%

Cross-task Generalization

MMLU-Pro Results

Answer Regeneration maintains its advantage on more complex benchmarks and approaches official reported performance.

GSM8K Mathematical Reasoning

Answer Regeneration also performs best on mathematical tasks:

  • More robust handling of LaTeX format (\boxed{})
  • Human evaluation shows 16.3% vs 6.1% correct rate difference

TriviaQA Open-ended QA

Avoids model bias issues associated with LLM-as-a-judge approaches in open-ended tasks.

LLM Evaluation Frameworks

Existing evaluation tools such as lm-evaluation-harness, HELM, and OpenCompass primarily rely on:

  1. Probability-based evaluation for multiple-choice questions
  2. Simple heuristic post-processing for generation tasks

Prompt Sensitivity Research

Prior research has focused on prompt variations at the input level affecting performance, but lacks systematic study of answer extraction at the output level.

Reasoning Model Evaluation

The emergence of reasoning methods like Chain-of-Thought has created new challenges for traditional evaluation approaches.

Conclusions and Discussion

Main Conclusions

  1. Answer extraction methods have a decisive impact on reasoning model evaluation, with performance differences exceeding 10%
  2. Answer Regeneration provides a more robust evaluation scheme, outperforming manual rules across multiple tasks
  3. Evaluation fairness is improved, with model ranking more aligned with intuitive expectations

Limitations

  1. Computational Cost: Requires additional reasoning steps, increasing evaluation overhead
  2. Limited Technical Innovation: The method itself is relatively simple, lacking technical depth
  3. Model Scope: Primarily tests open-source models; performance on commercial models remains to be verified

Future Directions

  1. Self-consistency Integration: Further improvements by combining techniques like self-consistency
  2. Commercial Model Evaluation: Extension to commercial models such as GPT, Gemini, and Claude
  3. Efficiency Optimization: Exploring methods to reduce computational overhead

In-depth Evaluation

Strengths

1. Importance of Problem Identification

First systematic revelation of answer extraction as an overlooked yet critical problem in reasoning model evaluation, with significant implications for the field.

2. Practical Utility of the Method

The proposed framework is simple and effective, easy to implement and deploy, with strong practical value.

3. Comprehensiveness of Experiments

  • Comprehensive evaluation across multiple models and task types
  • Detailed ablation studies and human verification
  • Sufficient comparison with existing methods

4. Convincingness of Results

Extensive experiments demonstrate method effectiveness with statistically significant results.

Weaknesses

1. Limited Technical Innovation

The method itself is relatively simple, representing primarily an engineering improvement rather than deep technical innovation.

2. Computational Overhead Issues

Additional reasoning steps significantly increase evaluation costs, potentially becoming a bottleneck in large-scale evaluation.

3. Insufficient Theoretical Analysis

Lacks theoretical explanation for method effectiveness, relying primarily on experimental verification.

4. Model Dependency

The quality of regeneration still depends on the model's own capabilities, potentially introducing model bias.

Impact

Academic Contribution

  • Fills a gap in reasoning model evaluation methodology
  • Provides important reference for future evaluation framework design
  • Promotes attention to evaluation fairness and reproducibility

Practical Value

  • Directly applicable to improvements in existing evaluation frameworks
  • Provides model developers with more reliable performance benchmarks
  • Helps increase credibility of evaluation results

Reproducibility

The paper provides detailed implementation details and regular expressions, facilitating reproduction and application.

Applicable Scenarios

Suitable Application Scenarios

  1. Reasoning Model Evaluation: Particularly suitable for models requiring reasoning processes like CoT
  2. Multi-task Benchmark Testing: Application on standard benchmarks such as MMLU and GSM8K
  3. Model Comparison Research: When fair comparison of different reasoning models is needed

Limiting Conditions

  1. Sufficient Computational Resources: Must bear the cost of additional reasoning
  2. High Evaluation Accuracy Requirements: Suitable for scenarios with high quality standards
  3. Reasoning Model Specific: Primarily targets models with reasoning capabilities

References

  1. Hendrycks et al. (2021). Measuring massive multitask language understanding. ICLR.
  2. Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
  3. Liang et al. (2023). Holistic evaluation of language models. arXiv.
  4. Wang et al. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. NeurIPS.

Summary: While this paper is relatively simple in technical innovation, it identifies and addresses an important problem in reasoning model evaluation. The proposed Answer Regeneration framework provides a practical solution for fair and robust evaluation of reasoning models, with significant implications for standardization and reproducibility in the field. Despite limitations such as computational overhead, its practical value and contribution to evaluation methodology make it a valuable research contribution.