2025-11-14T16:10:11.389071

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Fan, Qin, Han et al.
Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
academic

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Basic Information

  • Paper ID: 2505.22017
  • Title: The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
  • Authors: Siqi Fan, Bowen Qin, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun
  • Category: cs.CL (Computation and Language)
  • Publication Date: October 14, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2505.22017

Abstract

Recent thinking models trained with reinforcement learning and backward chain-of-thought (CoT) suffer from over-thinking: they produce excessively long outputs even on simple problems, wasting computational resources. Existing token efficiency-based evaluation methods provide incomplete perspectives, overlooking problem difficulty and intermediate computational costs. This paper formalizes reasoning efficiency as a relative metric between thinking models and instruction models, treating instruction models as a minimal effort baseline. Through systematic investigation of four thinking models across multiple benchmarks, two consistent patterns are revealed: (i) instruction models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on simple problems but providing value on difficult ones. Based on these insights, COTHINK is proposed—a simple two-stage pipeline where instruction models draft brief outlines and thinking models perform expansion. On GSM8K, MATH500, and AIME24, COTHINK reduces token usage by 21.1% across four thinking models while maintaining accuracy.

Research Background and Motivation

Problem Definition

  1. Over-thinking Problem: Recent thinking models excel at mathematical reasoning tasks but suffer from severe over-thinking. These models produce 5-10 times longer outputs than standard instruction-tuned models even on simple problems.
  2. Evaluation Limitations: Existing reasoning efficiency evaluation methods have two main issues:
    • They ignore the relative nature of over-thinking and under-thinking, phenomena only observable through comparative analysis
    • They overlook intermediate computational costs, such as generating multiple candidate solutions in best-of-N sampling
  3. Computational Resource Waste: Thinking models on the AIME2024 benchmark show average output length increasing from 770 tokens for Qwen2.5-32B-Instruct to 6,067 tokens for QwQ, causing significant computational waste.

Research Motivation

Existing evaluation methods rely on single-model token efficiency τ(M,D) = Q(D)/CM(D), but this absolute metric cannot reflect relative reasoning efficiency. This paper argues for a relative efficiency framework to better assess thinking model performance.

Core Contributions

  1. Proposes Relative Reasoning Efficiency Evaluation Framework: Defines reasoning efficiency as a relative metric between thinking and instruction models: η(MR,MI) = τ(MR,D)/τ(MI,D)
  2. Identifies Two Key Patterns:
    • Instruction models show higher token efficiency overall
    • Problem difficulty strongly affects efficiency, with thinking models over-computing on simple problems but providing value on difficult ones
  3. Proposes COTHINK Two-Stage Collaborative Pipeline: Combines the conciseness of instruction models with the verification capabilities of thinking models
  4. Achieves Significant Efficiency Gains: Reduces token usage by 21.1% on average across three mathematical benchmarks while improving accuracy by 1.66%

Methodology Details

Task Definition

This paper investigates computational efficiency in mathematical reasoning tasks, with mathematical problems as input and solution processes with final answers as output. The constraint is minimizing computational cost while maintaining accuracy.

Relative Efficiency Evaluation Framework

Core Formula

Relative reasoning efficiency is defined as:

η(MR,MI) = τ(MR,D) / τ(MI,D)

where τ(M,D) = Q(D)/CM(D) is the traditional token efficiency.

Efficiency Scaling Law Assumption

Based on test-time scaling law Q(C) ∝ C^β (β < 1), reasoning efficiency can be approximated as:

η ≈ (CR/CI)^β

COTHINK Two-Stage Pipeline

Stage One: Outline Generation

The instruction model generates a concise outline of 2-4 high-level reasoning steps without specific calculations or final answers.

System Prompt:

You are a reasoning strategist.
Your job is to break down a complex problem into 2–4 high-level reasoning steps.
Focus only on outlining the general approach or strategy.
Do not include any numbers, formulas, or final answers.

Stage Two: Verification and Expansion

The thinking model verifies and completes based on the outline, using fewer tokens.

User Prompt:

Use only the following steps to solve the problem. Do not change or add steps.
Show the work for each step briefly, and place the final answer in \boxed{}.
Problem: {problem}
Steps: {outline generated by instruct model}

Technical Innovations

  1. Dynamic Difficulty Adaptation: Without pre-assessing problem difficulty, thinking models can dynamically adjust verification effort based on outline quality
  2. Complementary Advantages Integration: Simple tasks typically have correct outlines with fast thinking model convergence; difficult tasks benefit from structured starting points
  3. Deployment-Friendly: Requires no architectural modifications and can be directly applied to existing models

Experimental Setup

Datasets

Three mathematical reasoning benchmarks with increasing difficulty:

  • GSM8K: Elementary level, 1,319 samples, solution length 48-1,070 tokens
  • MATH500: High school level, 500 samples, solution length 45-3,360 tokens
  • AIME24: University level, 30 samples, solution length 284-4,010 tokens

Model Configuration

Evaluates 5 representative 32B-scale models:

  • Qwen2.5-32B-Instruct: General instruction model (baseline)
  • DAPO: RL-trained thinking model only
  • DeepSeek-R1-Distill: Distillation-based thinking model
  • QwQ: SFT+RL-trained thinking model
  • Qwen3: Hybrid thinking model (supports thinking/non-thinking modes)

Evaluation Metrics

  • Pass@1: First-attempt correctness rate
  • #Tokens: Total tokens generated per problem
  • Token Efficiency τ: Quality/cost ratio
  • Reasoning Efficiency η: Efficiency relative to instruction model
  • Win Rate: Proportion of advantages across all evaluation points

Baseline Methods

  • Solo-Thinking: Single model solving independently
  • Best-of-N Sampling: Generate N=5 candidate solutions, select shortest
  • No-Thinking: Skip thinking process and generate directly

Experimental Results

Main Results

Relative Efficiency Analysis Findings

  1. Observation 1: Instruction models show high token efficiency, most thinking models have η < 1
  2. Observation 2: Problem difficulty affects reasoning efficiency, thinking models waste computation on simple problems but provide value on complex tasks

COTHINK Performance

  • Overall Win Rate: 61.7% (37/60 evaluation points)
  • Task-Specific Win Rates:
    • GSM8K: 37.5% (large improvement space on simple tasks)
    • MATH500: 87.5% (best performance on high school level tasks)
    • AIME24: 60% (good performance on university level tasks)

Efficiency Gains

  • Average Token Reduction: 21.1%, up to 41.8%
  • Accuracy Improvement: Average 1.66%
  • Model Ranking (efficiency improvement): QwQ > DeepSeek-R1-Distill > DAPO

Case Studies

AIME24 Case Study

Comparative analysis reveals three scenarios:

  1. 5 problems: Both models succeed, instruction model concise, thinking model verbose
  2. 16 problems: Only thinking model succeeds (through verification correction)
  3. 9 problems: Both models fail

Key Finding: Providing instruction model with thinking model episodes as prefix solves problems using only 27.5% of episodes and 11.9% of tokens.

Ablation Studies

Efficiency Source Analysis

  1. Algorithmic Inefficiency: RL training may reduce information density per step, encouraging more verbose generation
  2. Data Distribution Inefficiency: Backward CoT training produces multi-episode verification patterns that persist during inference

Impact of Different Training Strategies

  • SFT-trained Models (QwQ, DeepSeek-R1-Distill) better follow COTHINK outline instructions
  • Pure RL-trained Models (DAPO) show weaker consistency but still demonstrate strong guiding capability on tasks like MATH500

Token Efficiency Research

Existing approaches to address over-thinking include:

  • Restricting output length through prompting
  • Encouraging early stopping
  • RL training with length penalties
  • SFT on short solutions

Hybrid Reasoning Methods

Recent work explores adaptive task allocation:

  • Qwen3 and NoThinking use hard-coded switching rules
  • Key challenge: LLMs cannot perceive problem difficulty during prefill stage

Sketch Prompting Engineering

COTHINK is inspired by sketch prompting, with related concurrent work including:

  • Thought Manipulation: Inserting pre-generated CoT between thinking tags
  • Scot: Lightweight models drafting multiple CoT sketches in parallel

Conclusions and Discussion

Main Conclusions

  1. Importance of Relative Efficiency Evaluation: Traditional token efficiency evaluation is insufficient; relative perspectives are needed
  2. Difficulty-Dependent Efficiency Patterns: Over-thinking on simple problems, value demonstrated on complex ones
  3. Effectiveness of Collaborative Pipeline: COTHINK successfully combines complementary advantages of both model types

Limitations

  1. Limited Improvement on Simple Tasks: Win rate only 37.5% on simple tasks like GSM8K
  2. Dependence on Outline Quality: Second-stage performance affected by first-stage outline quality
  3. Restricted Evaluation Scope: Primarily validated on mathematical reasoning; applicability to other domains remains uncertain

Future Directions

  1. Extension to Other Reasoning Tasks: Code generation, logical reasoning, etc.
  2. Dynamic Outline Adjustment: Refine outlines based on thinking model feedback
  3. End-to-End Optimization: Joint training of two-stage models

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: Accurately identifies over-thinking in thinking models
  2. Innovative Evaluation Framework: Relative efficiency evaluation more reasonable than traditional absolute metrics
  3. Simple and Effective Method: COTHINK design is intuitive and easy to implement and deploy
  4. Comprehensive Experiments: Covers multiple models, datasets, and evaluation dimensions
  5. Thorough Theoretical Analysis: Provides theoretical framework for efficiency scaling laws

Weaknesses

  1. Limited Theoretical Foundation: Efficiency scaling law assumptions lack rigorous proof
  2. Simple Outline Generation Strategy: First-stage prompt engineering relatively crude
  3. Insufficient Cross-Domain Validation: Only validated on mathematical reasoning tasks
  4. Incomplete Computational Overhead Analysis: Lacks detailed analysis of two-stage pipeline overhead

Impact

  1. Academic Contribution: Provides new perspective on reasoning efficiency evaluation, potentially influencing future evaluation standards
  2. Practical Value: COTHINK directly applicable to existing systems, reducing inference costs
  3. Reproducibility: Clear method description with commitment to open-source code

Applicable Scenarios

  1. Computationally Constrained Environments: Scenarios requiring balance between accuracy and efficiency
  2. Mixed-Difficulty Tasks: Applications containing both simple and complex problems
  3. Real-Time Inference Systems: Interactive systems with response time requirements

References

The paper cites important works in reasoning efficiency, thinking models, and hybrid reasoning, providing solid theoretical foundation and comparative references.


Overall Assessment: This is a high-quality paper with significant contributions to reasoning efficiency evaluation and optimization. By introducing a relative efficiency evaluation framework and the COTHINK collaborative pipeline, it provides an effective solution to the over-thinking problem in thinking models. Despite some limitations, its innovation and practical value make it highly valuable in this field.