2025-11-10T02:49:44.009603

ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Zheng
Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.
academic

ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Basic Information

  • Paper ID: 2510.00071
  • Title: ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models
  • Author: Dongqi Zheng (Independent Researcher)
  • Classification: cs.AI cs.CL
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.00071v2

Abstract

Large Reasoning Language Models (LRLMs) demonstrate exceptional capabilities in complex reasoning tasks, but suffer from significant computational efficiency issues due to the "overthinking" phenomenon. Existing efficient reasoning methods face challenges in balancing reasoning quality with reduced inference costs. This paper proposes Adaptive Reasoning Suppression (ARS), a novel training-free method that dynamically suppresses redundant reasoning steps through adaptive deterministic monitoring while maintaining accuracy. ARS introduces a multi-checkpoint deterministic estimation mechanism and progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. On mathematical reasoning benchmarks across multiple model architectures, ARS achieves reductions of up to 53%, 46.1%, and 57.9% in tokens, latency, and energy consumption respectively, while maintaining or improving accuracy.

Research Background and Motivation

Problem Definition

Large Reasoning Models (LRMs) such as OpenAI's o1/o3 and DeepSeek-R1 have achieved revolutionary progress in complex tasks including mathematics, programming, and scientific reasoning through sophisticated chain-of-thought (CoT) reasoning mechanisms. However, these models suffer from severe "overthinking" phenomena, where models continue generating redundant reasoning steps even after arriving at correct intermediate solutions.

Problem Significance

The overthinking phenomenon leads to:

  1. Excessive Computational Overhead: Unnecessary long reasoning times
  2. Resource Waste: Increased token consumption and computational costs
  3. Low Efficiency: Impacts practical deployment and applications

Limitations of Existing Methods

Existing solutions fall into three categories:

  1. Prompt-Guided Methods: Guide model reasoning within predefined token budgets
  2. Training-Based Methods: Fine-tune models to achieve concise reasoning
  3. Decoding Operation Methods: Dynamically adjust the reasoning process

These methods commonly suffer from static thresholds and lack of adaptability.

Research Motivation

This paper aims to develop a training-agnostic adaptive method that can:

  • Dynamically monitor model determinism
  • Progressively adjust suppression intensity
  • Significantly improve efficiency while maintaining reasoning quality

Core Contributions

  1. Proposes ARS Framework: The first determinism-guided reasoning suppression method with dynamic suppression through progressive threshold adjustment
  2. Multi-Checkpoint Mechanism: Establishes multiple checkpoints for determinism estimation, overcoming limitations of single-point evaluation
  3. Theoretical Guarantees: Provides theoretical analysis and efficiency guarantees for ARS performance
  4. Comprehensive Evaluation: Validates method effectiveness across multiple model architectures and mathematical reasoning benchmarks
  5. Significant Performance Improvements: Achieves substantial reductions in tokens, latency, and energy consumption while maintaining accuracy

Method Details

Task Definition

Given a reasoning query q and a large reasoning language model π, the standard generation process produces output tokens o = {o₁, o₂, ..., oₜ}, where oₜ ~ π(·|q, o<ₜ). The objective is to minimize expected output length ET while maintaining reasoning accuracy:

min E[T] subject to E[L(f(o), y)] ≤ ε

where f(o) extracts the final answer from output o, y is the ground truth answer, L is the loss function, and ε is the acceptable accuracy degradation threshold.

Model Architecture

The ARS framework comprises three core components:

1. Multi-Checkpoint Determinism Estimation

  • Establishes multiple checkpoints {c₁, c₂, ..., cₖ} during generation
  • Estimates model determinism at each checkpoint cᵢ through tentative answer probing
  • Uses heuristic difficulty estimation function:
D(q) = 0.4 · min(1, |q|words/80) + 0.4 · Σcount(k,q)/(3|K|) + 0.2 · min(1, |symbols(q)|/10)

2. Progressive Threshold Adaptation

  • Dynamically adjusts suppression thresholds based on reasoning progress patterns
  • Adapts based on determinism trends
  • Supports three modes: FAST, MOD, DeepReflect

3. Dynamic Suppression Mechanism

  • Adaptive suppression intensity control
  • Based on trigger word set T = {"Wait", "But", "Alternatively", ...}
  • Suppresses reflective behavior when high determinism is detected

Technical Innovations

  1. Adaptivity: Unlike static suppression methods, ARS dynamically adapts to each model's reasoning trajectory
  2. Multi-Checkpoint Design: Overcomes instability of single-point evaluation
  3. Progressive Adjustment: Dynamically adjusts suppression strategy based on determinism trends
  4. Training-Free Characteristic: Can be directly deployed to existing models without additional fine-tuning

Theoretical Analysis

Theorem 1 (Efficiency Guarantee): For queries with reasoning complexity R(q) ≤ Rmax, the output length TARS produced by ARS satisfies:

E[TARS] ≤ (1 + εR) · T* + O(√log Rmax)

with probability at least 1-δ, where εR → 0 as the number of checkpoints increases.

Experimental Setup

Datasets

  • GSM8K: Elementary school mathematics word problem dataset
  • MATH500: High school and university-level mathematics competition problems
  • Each dataset evaluates n=200 problems

Evaluation Metrics

  • Acc↑: Accuracy (higher is better)
  • Lat↓: Latency in seconds (lower is better)
  • TPC↓: Tokens per correct answer (lower is better)
  • JPC↓: Joules per correct answer (lower is better)

Comparison Methods

  1. Vanilla: Standard generation
  2. TALE: Token-aware length-constrained reasoning
  3. CGRS: Confidence-guided reasoning suppression

Implementation Details

  • Models: Qwen2.5-Math-1.5B/7B-Instruct, DeepSeek-R1-Distill-Qwen-7B
  • Hardware: V100-32GB GPU
  • Maximum token limit: 1200 tokens per response

Experimental Results

Main Results

GSM8K Dataset Performance:

  • Qwen-1.5B: 91.0% accuracy, 27.3% latency reduction, 22.5% token reduction, 24.5% energy reduction
  • Qwen-7B: 94.5% accuracy (8% improvement), 6.3% latency reduction, 16.7% token reduction, 14.3% energy reduction
  • DeepSeek-7B: 93.0% accuracy, 46.1% latency reduction, 43.5% token reduction, 46.6% energy reduction

MATH500 Dataset Performance:

  • On the more challenging MATH500, ARS achieves significant efficiency improvements
  • Token reduction reaches up to 53.0% on DeepSeek-7B model

Key Findings

  1. Variable Efficiency Gains: ARS demonstrates context-dependent performance improvements with maximum token reduction reaching 53%
  2. Accuracy Preservation: Despite efficiency orientation, ARS maintains competitive accuracy across all benchmarks
  3. Architecture-Dependent Performance: DeepSeek-7B shows most consistent improvements, while Qwen models show more variable performance
  4. Multi-Metric Improvements: Beyond tokens, achieves 46.1% latency reduction and 57.9% energy savings

Case Analysis

The paper demonstrates ARS effectiveness through a geometric sequence problem from MATH500:

  • Difficulty-aware mode selection chooses appropriate reasoning depth
  • Progressive determinism monitoring enables early confidence stability detection
  • Adaptive suppression becomes more aggressive as confidence builds
  • Trend-based adjustment prevents unnecessary reflection loops

Main Research Directions

  1. Prompt Engineering Methods: Guide models to reason within budgets through instructions
  2. Model Training Optimization: Train models to generate concise reasoning
  3. Decoding Strategies: Dynamically adjust the reasoning process

Advantages of This Work

  • Training-free design enables immediate deployment
  • Adaptive mechanism provides finer quality-efficiency trade-offs
  • Multi-checkpoint mechanism improves stability

Conclusions and Discussion

Main Conclusions

ARS successfully addresses key limitations of existing methods by integrating adaptive determinism monitoring, progressive threshold adjustment, and dynamic suppression intensity control. Experiments demonstrate that ARS achieves significant computational efficiency improvements while maintaining or improving accuracy.

Limitations

  1. Maximum Generation Length Constraint: The 1200-token limit may affect accuracy on complex problems
  2. Architecture Dependency: Performance varies significantly across different model architectures
  3. Evaluation Scope: Primarily focused on mathematical reasoning tasks

Future Directions

  1. Extend to broader reasoning paradigms beyond mathematical problem solving
  2. Explore checkpoint-aware scheduling strategies
  3. Develop richer determinism estimation mechanisms tailored to specific model behaviors

In-Depth Evaluation

Strengths

  1. Method Innovation: First to propose adaptive reasoning suppression concept with novel technical approach
  2. Theoretical Foundation: Provides theoretical analysis and performance guarantees
  3. Experimental Sufficiency: Comprehensive evaluation across multiple models and datasets
  4. Practical Value: Training-free characteristic enables easy deployment
  5. Significant Performance: Achieves substantial improvements in efficiency metrics

Weaknesses

  1. Evaluation Limitations: Primarily evaluated on mathematical reasoning tasks; generalization remains to be verified
  2. Limited Baselines: Relatively limited comparison methods; lacks more recent approaches
  3. Theoretical Analysis: Theoretical guarantees lack rigorous proofs
  4. Parameter Sensitivity: Lacks sensitivity analysis for key hyperparameters
  5. Computational Overhead: Insufficient analysis of computational overhead from multi-checkpoint mechanism

Impact

  1. Academic Contribution: Provides new research direction for reasoning efficiency optimization
  2. Practical Value: Significant implications for large model deployment
  3. Reproducibility: Clear algorithm description facilitates reproduction

Applicable Scenarios

  1. Resource-Constrained Environments: Mobile devices, edge computing scenarios
  2. Real-Time Applications: Reasoning tasks requiring rapid response
  3. Cost-Sensitive Applications: Commercial applications requiring computational cost control
  4. Mathematical Reasoning Tasks: Primary application domain currently validated

References

The paper cites 21 relevant references covering important works in large language model reasoning, chain-of-thought, mathematical problem solving, and related fields, providing solid theoretical foundation for the research.


Overall Assessment: This is an important paper making significant contributions to efficiency optimization in large reasoning models. The ARS method is cleverly designed with convincing experimental results, providing an effective solution to the overthinking problem in reasoning models. Despite some limitations, its innovation and practical value make it an important advance in this field.