The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
Fan, Qin, Han et al.
Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
academic
The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
Recent thinking models trained with reinforcement learning and backward chain-of-thought (CoT) suffer from over-thinking: they produce excessively long outputs even on simple problems, wasting computational resources. Existing token efficiency-based evaluation methods provide incomplete perspectives, overlooking problem difficulty and intermediate computational costs. This paper formalizes reasoning efficiency as a relative metric between thinking models and instruction models, treating instruction models as a minimal effort baseline. Through systematic investigation of four thinking models across multiple benchmarks, two consistent patterns are revealed: (i) instruction models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on simple problems but providing value on difficult ones. Based on these insights, COTHINK is proposed—a simple two-stage pipeline where instruction models draft brief outlines and thinking models perform expansion. On GSM8K, MATH500, and AIME24, COTHINK reduces token usage by 21.1% across four thinking models while maintaining accuracy.
Over-thinking Problem: Recent thinking models excel at mathematical reasoning tasks but suffer from severe over-thinking. These models produce 5-10 times longer outputs than standard instruction-tuned models even on simple problems.
Evaluation Limitations: Existing reasoning efficiency evaluation methods have two main issues:
They ignore the relative nature of over-thinking and under-thinking, phenomena only observable through comparative analysis
They overlook intermediate computational costs, such as generating multiple candidate solutions in best-of-N sampling
Computational Resource Waste: Thinking models on the AIME2024 benchmark show average output length increasing from 770 tokens for Qwen2.5-32B-Instruct to 6,067 tokens for QwQ, causing significant computational waste.
Existing evaluation methods rely on single-model token efficiency τ(M,D) = Q(D)/CM(D), but this absolute metric cannot reflect relative reasoning efficiency. This paper argues for a relative efficiency framework to better assess thinking model performance.
Proposes Relative Reasoning Efficiency Evaluation Framework: Defines reasoning efficiency as a relative metric between thinking and instruction models: η(MR,MI) = τ(MR,D)/τ(MI,D)
Identifies Two Key Patterns:
Instruction models show higher token efficiency overall
Problem difficulty strongly affects efficiency, with thinking models over-computing on simple problems but providing value on difficult ones
Proposes COTHINK Two-Stage Collaborative Pipeline: Combines the conciseness of instruction models with the verification capabilities of thinking models
Achieves Significant Efficiency Gains: Reduces token usage by 21.1% on average across three mathematical benchmarks while improving accuracy by 1.66%
This paper investigates computational efficiency in mathematical reasoning tasks, with mathematical problems as input and solution processes with final answers as output. The constraint is minimizing computational cost while maintaining accuracy.
The instruction model generates a concise outline of 2-4 high-level reasoning steps without specific calculations or final answers.
System Prompt:
You are a reasoning strategist.
Your job is to break down a complex problem into 2–4 high-level reasoning steps.
Focus only on outlining the general approach or strategy.
Do not include any numbers, formulas, or final answers.
The thinking model verifies and completes based on the outline, using fewer tokens.
User Prompt:
Use only the following steps to solve the problem. Do not change or add steps.
Show the work for each step briefly, and place the final answer in \boxed{}.
Problem: {problem}
Steps: {outline generated by instruct model}
Dynamic Difficulty Adaptation: Without pre-assessing problem difficulty, thinking models can dynamically adjust verification effort based on outline quality
Complementary Advantages Integration: Simple tasks typically have correct outlines with fast thinking model convergence; difficult tasks benefit from structured starting points
Deployment-Friendly: Requires no architectural modifications and can be directly applied to existing models
Observation 1: Instruction models show high token efficiency, most thinking models have η < 1
Observation 2: Problem difficulty affects reasoning efficiency, thinking models waste computation on simple problems but provide value on complex tasks
The paper cites important works in reasoning efficiency, thinking models, and hybrid reasoning, providing solid theoretical foundation and comparative references.
Overall Assessment: This is a high-quality paper with significant contributions to reasoning efficiency evaluation and optimization. By introducing a relative efficiency evaluation framework and the COTHINK collaborative pipeline, it provides an effective solution to the over-thinking problem in thinking models. Despite some limitations, its innovation and practical value make it highly valuable in this field.