2025-11-14T16:10:11.389071

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Fan, Qin, Han et al.

Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.

academic

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

基本信息

论文ID: 2505.22017
标题: The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
作者: Siqi Fan, Bowen Qin, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun
分类: cs.CL (Computation and Language)
发表时间: 2025年10月14日 (arXiv v2)
论文链接: https://arxiv.org/abs/2505.22017

摘要

近期使用强化学习和反向检查链式思维(CoT)训练的思考模型存在过度思考问题：即使在简单问题上也会产生过长的输出，浪费计算资源。现有基于token效率的评估方法提供了不完整的视角，忽略了问题难度和中间计算成本。本文将推理效率形式化为思考模型与指令模型之间的相对度量，将指令模型视为最小努力基线。通过对四个思考模型和多个基准的系统研究，揭示了两个一致模式：(i)指令模型总体上实现了更高的效率，(ii)问题难度影响效率，思考模型在简单问题上浪费计算，但在困难问题上提供价值。基于这一洞察，提出了COTHINK——一个简单的两阶段管道：指令模型起草简要大纲，思考模型进行扩展。在GSM8K、MATH500和AIME24上，COTHINK在四个思考模型上减少21.1%的token使用量同时保持准确性。

研究背景与动机

问题定义

过度思考问题：近期的思考模型(thinking models)在数学推理任务中表现出色，但存在严重的过度思考问题。这些模型即使在简单问题上也会产生5-10倍于标准指令调优模型的输出长度。
评估局限性：现有的推理效率评估方法存在两个主要问题：
- 忽略了过度思考和思考不足的相对概念，这些现象只能通过比较分析观察到
- 忽略了中间计算成本，如best-of-N采样中生成多个候选解的成本
计算资源浪费：思考模型在AIME2024基准上的平均输出长度从Qwen2.5-32B-Instruct的770个token增加到QwQ的6,067个token，造成显著的计算资源浪费。

研究动机

现有评估方法基于单一模型的token效率τ(M,D) = Q(D)/CM(D)，但这种绝对度量无法反映推理的相对效率。本文认为需要一个相对效率框架来更好地评估思考模型的性能。

核心贡献

提出相对推理效率评估框架：将推理效率定义为思考模型相对于指令模型的相对度量η(MR,MI) = τ(MR,D)/τ(MI,D)
发现两个关键模式：
- 指令模型总体上显示更高的token效率
- 问题难度强烈影响效率，思考模型在简单问题上过度计算但在困难问题上提供价值
提出COTHINK两阶段协作管道：结合指令模型的简洁性和思考模型的验证能力
实现显著的效率提升：在三个数学基准上平均减少21.1%的token使用量，同时提高1.66%的准确率

方法详解

任务定义

本文研究数学推理任务中的计算效率问题，输入为数学问题，输出为解答过程和最终答案。约束条件是在保持准确性的前提下最小化计算成本。

相对效率评估框架

核心公式

相对推理效率定义为：

η(MR,MI) = τ(MR,D) / τ(MI,D)

其中τ(M,D) = Q(D)/CM(D)是传统的token效率。

效率缩放律假设

基于测试时缩放律Q(C) ∝ C^β (β < 1)，推理效率可近似为：

η ≈ (CR/CI)^β

COTHINK两阶段管道

第一阶段：大纲生成

指令模型生成2-4个高层次推理步骤的简洁大纲，不包含具体计算或最终答案。

系统提示：

You are a reasoning strategist.
Your job is to break down a complex problem into 2–4 high-level reasoning steps.
Focus only on outlining the general approach or strategy.
Do not include any numbers, formulas, or final answers.

第二阶段：验证扩展

思考模型根据大纲进行验证和完成，使用更少的token。

用户提示：

Use only the following steps to solve the problem. Do not change or add steps.
Show the work for each step briefly, and place the final answer in \boxed{}.
Problem: {problem}
Steps: {outline generated by instruct model}