2025-11-18T06:58:13.108824

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Jo, Lee, Lee et al.

Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

academic

Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Basic Information

Paper ID: 2510.14773
Title: Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
Authors: Hwiyeol Jo, Joosung Lee, Jaehong Lee, Sang-Woo Lee, Joonsuk Park, Kang Min Yoo
Classification: cs.CL cs.AI
Publication Date: October 16, 2024
Paper Link: https://arxiv.org/abs/2510.14773

Abstract

This paper investigates a critical issue in evaluating the reasoning capabilities of large language models (LLMs): the significant impact of answer extraction methods on model performance assessment. The research reveals that the performance of reasoning models and the distribution of final answers are highly dependent on the adopted answer extraction algorithm. To address this problem, the authors propose an "Answer Regeneration" framework, which achieves robust evaluation independent of extraction rules through an additional model reasoning step that regenerates the final answer using an "Answer:" prefix.

Research Background and Motivation

Core Problem

Traditional LLM evaluation typically relies on probability distributions of answer selection, but for models requiring reasoning, answer extraction methods become critical. Existing rule-based extraction methods suffer from the following issues:

Format Diversity: Reasoning model outputs vary widely in format, and a single extraction rule cannot cover all cases
Inter-model Differences: Different models use different answer formats, requiring customized extraction rules for each model
Evaluation Inconsistency: The same model output may receive completely different evaluation results depending on the extraction rule used

Research Motivation

Reproducibility Issues: Discrepancies between publicly reported performance and reproduction results may stem from undisclosed answer extraction methods
Evaluation Fairness: Rule-based methods may introduce bias toward certain models
Special Characteristics of Reasoning Models: The complexity of Chain-of-Thought (CoT) reasoning outputs renders traditional evaluation methods inadequate

Core Contributions

First systematic study of the sensitivity of answer extraction methods to reasoning model evaluation, revealing this overlooked but critical problem
Proposes the Answer Regeneration framework, achieving robust evaluation independent of extraction rules
Demonstrates the generalizability of the method, achieving improvements across multiple task types including multiple-choice questions, mathematical problems, and open-ended question answering
Provides more reliable model ranking, making evaluation results more intuitive (e.g., larger models outperforming smaller ones)

Methodology Details

Task Definition

Given the output of a reasoning model (containing the complete reasoning process), the task is to accurately extract its final answer for evaluation. Traditional methods rely on manually crafted regular expression rules, while this paper proposes a generative solution.

Answer Regeneration Framework

Overall Architecture

Original Input + Reasoning Output + "Answer:" → Model Reasoning → Simplified Final Answer

Core Steps

Input Preparation: Combine the original question, model's reasoning process, and "Answer:" prompt
Re-reasoning: Use the model (in non-reasoning mode) for an additional reasoning step
Answer Extraction: Extract the final answer from the simplified output

Technical Advantages

Probability-based Foundation: For multiple-choice questions, probability-based answer selection can be used
Output Simplification: Generated answers have a more concise format, facilitating extraction
Rule-independent: Does not rely on complex manual rules

Technical Innovations

1. Generative Answer Extraction

Unlike traditional pattern matching, this approach leverages the model's own generative capabilities to "rephrase" the final answer, avoiding the complexity of format parsing.

2. Reasoning-Generation Separation

Separates the reasoning process from answer generation, with the reasoning phase focusing on the thought process and the generation phase focusing on answer output.

3. Adaptability

The framework automatically adapts to different task types and answer formats without requiring tuning for specific models or tasks.

Experimental Setup

Datasets

MMLU: Multi-domain multiple-choice knowledge test, serving as the primary evaluation benchmark
MMLU-Pro: More complex multiple-choice benchmark with dynamic option counts
GSM8K: Mathematical reasoning problems with short answer format
TriviaQA: Open-ended question answering task

Evaluation Models

Qwen3 Series: Qwen3-32B, Qwen3-14B, Qwen3-8B
DeepSeek-R1 Series: R1-Distill-Llama-8B, R1-Qwen3-8B

Comparison Methods

strict-match: Exact string matching ("answer is X")
flexible-extract: Flexible option extraction (searching for (A), (B), etc.)
instructed-format: Guided format output
answer-is-correct: Optimized strict matching
last-extract: Extract the last uppercase letter

Implementation Details

Uses lm-evaluation-harness toolkit
Temperature set to 0.6, top-p to 0.95, top-k to 20
Maximum generation length limited to 4096 tokens

Experimental Results

Main Results

Significant Performance Fluctuations

Different extraction methods lead to substantial performance variations:

Qwen3-32B accuracy range across different methods: 75.8% - 87.1%
Model ranking can be completely altered by extraction method

Clear Advantages of Answer Regeneration

Answer Regeneration achieves the best performance across all tested models:

Model	Best Rule-based Method	Answer Regeneration	Improvement
Qwen3-32B	82.1%	87.1%	+5.0%
Qwen3-14B	83.8%	85.0%	+1.2%
Qwen3-8B	82.1%	83.3%	+1.2%
R1-Llama-8B	64.8%	68.8%	+4.0%
R1-Qwen3-8B	77.6%	80.7%	+3.1%

Ablation Studies

Answer Inconsistency Analysis

The same model output may be parsed as different answers by different extraction methods:

Some methods extract answers from the reasoning process
Some methods extract formatted final answers
Some methods fail due to formatting issues

Incomplete Reasoning Handling

Answer Regeneration performs better when handling incomplete reasoning outputs:

Traditional methods tend to fail when reasoning is truncated
Regeneration method can provide answers based on available information

Human Evaluation Verification

In human evaluation of 300 samples:

Answer Regeneration consistency with human annotation: 84.2%
Best rule-based method consistency with human annotation: 61.7%

Cross-task Generalization

MMLU-Pro Results

Answer Regeneration maintains its advantage on more complex benchmarks and approaches official reported performance.

GSM8K Mathematical Reasoning

Answer Regeneration also performs best on mathematical tasks:

More robust handling of LaTeX format (\boxed{})
Human evaluation shows 16.3% vs 6.1% correct rate difference

TriviaQA Open-ended QA

Avoids model bias issues associated with LLM-as-a-judge approaches in open-ended tasks.

LLM Evaluation Frameworks

Existing evaluation tools such as lm-evaluation-harness, HELM, and OpenCompass primarily rely on:

Probability-based evaluation for multiple-choice questions
Simple heuristic post-processing for generation tasks

Prompt Sensitivity Research

Prior research has focused on prompt variations at the input level affecting performance, but lacks systematic study of answer extraction at the output level.

Reasoning Model Evaluation

The emergence of reasoning methods like Chain-of-Thought has created new challenges for traditional evaluation approaches.

Conclusions and Discussion

Main Conclusions

Answer extraction methods have a decisive impact on reasoning model evaluation, with performance differences exceeding 10%
Answer Regeneration provides a more robust evaluation scheme, outperforming manual rules across multiple tasks
Evaluation fairness is improved, with model ranking more aligned with intuitive expectations

Limitations

Computational Cost: Requires additional reasoning steps, increasing evaluation overhead
Limited Technical Innovation: The method itself is relatively simple, lacking technical depth
Model Scope: Primarily tests open-source models; performance on commercial models remains to be verified

Future Directions

Self-consistency Integration: Further improvements by combining techniques like self-consistency
Commercial Model Evaluation: Extension to commercial models such as GPT, Gemini, and Claude
Efficiency Optimization: Exploring methods to reduce computational overhead

In-depth Evaluation

Strengths

1. Importance of Problem Identification

First systematic revelation of answer extraction as an overlooked yet critical problem in reasoning model evaluation, with significant implications for the field.

2. Practical Utility of the Method

The proposed framework is simple and effective, easy to implement and deploy, with strong practical value.

3. Comprehensiveness of Experiments

Comprehensive evaluation across multiple models and task types
Detailed ablation studies and human verification
Sufficient comparison with existing methods

4. Convincingness of Results

Extensive experiments demonstrate method effectiveness with statistically significant results.

Weaknesses

1. Limited Technical Innovation

The method itself is relatively simple, representing primarily an engineering improvement rather than deep technical innovation.

2. Computational Overhead Issues

Additional reasoning steps significantly increase evaluation costs, potentially becoming a bottleneck in large-scale evaluation.

3. Insufficient Theoretical Analysis

Lacks theoretical explanation for method effectiveness, relying primarily on experimental verification.

4. Model Dependency

The quality of regeneration still depends on the model's own capabilities, potentially introducing model bias.

Impact

Academic Contribution

Fills a gap in reasoning model evaluation methodology
Provides important reference for future evaluation framework design
Promotes attention to evaluation fairness and reproducibility

Practical Value

Directly applicable to improvements in existing evaluation frameworks
Provides model developers with more reliable performance benchmarks
Helps increase credibility of evaluation results

Reproducibility

The paper provides detailed implementation details and regular expressions, facilitating reproduction and application.

Applicable Scenarios

Suitable Application Scenarios

Reasoning Model Evaluation: Particularly suitable for models requiring reasoning processes like CoT
Multi-task Benchmark Testing: Application on standard benchmarks such as MMLU and GSM8K
Model Comparison Research: When fair comparison of different reasoning models is needed

Limiting Conditions

Sufficient Computational Resources: Must bear the cost of additional reasoning
High Evaluation Accuracy Requirements: Suitable for scenarios with high quality standards
Reasoning Model Specific: Primarily targets models with reasoning capabilities

References

Hendrycks et al. (2021). Measuring massive multitask language understanding. ICLR.
Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
Liang et al. (2023). Holistic evaluation of language models. arXiv.
Wang et al. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. NeurIPS.

Summary: While this paper is relatively simple in technical innovation, it identifies and addresses an important problem in reasoning model evaluation. The proposed Answer Regeneration framework provides a practical solution for fair and robust evaluation of reasoning models, with significant implications for standardization and reproducibility in the field. Despite limitations such as computational overhead, its practical value and contribution to evaluation methodology make it a valuable research contribution.