2025-11-10T02:49:44.009603

ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Zheng

Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.

academic

ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Basic Information

Paper ID: 2510.00071
Title: ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models
Author: Dongqi Zheng (Independent Researcher)
Classification: cs.AI cs.CL
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.00071v2

Abstract

Large Reasoning Language Models (LRLMs) demonstrate exceptional capabilities in complex reasoning tasks, but suffer from significant computational efficiency issues due to the "overthinking" phenomenon. Existing efficient reasoning methods face challenges in balancing reasoning quality with reduced inference costs. This paper proposes Adaptive Reasoning Suppression (ARS), a novel training-free method that dynamically suppresses redundant reasoning steps through adaptive deterministic monitoring while maintaining accuracy. ARS introduces a multi-checkpoint deterministic estimation mechanism and progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. On mathematical reasoning benchmarks across multiple model architectures, ARS achieves reductions of up to 53%, 46.1%, and 57.9% in tokens, latency, and energy consumption respectively, while maintaining or improving accuracy.

Research Background and Motivation

Problem Definition

Large Reasoning Models (LRMs) such as OpenAI's o1/o3 and DeepSeek-R1 have achieved revolutionary progress in complex tasks including mathematics, programming, and scientific reasoning through sophisticated chain-of-thought (CoT) reasoning mechanisms. However, these models suffer from severe "overthinking" phenomena, where models continue generating redundant reasoning steps even after arriving at correct intermediate solutions.

Problem Significance

The overthinking phenomenon leads to:

Excessive Computational Overhead: Unnecessary long reasoning times
Resource Waste: Increased token consumption and computational costs
Low Efficiency: Impacts practical deployment and applications

Limitations of Existing Methods

Existing solutions fall into three categories:

Prompt-Guided Methods: Guide model reasoning within predefined token budgets
Training-Based Methods: Fine-tune models to achieve concise reasoning
Decoding Operation Methods: Dynamically adjust the reasoning process

These methods commonly suffer from static thresholds and lack of adaptability.

Research Motivation

This paper aims to develop a training-agnostic adaptive method that can:

Dynamically monitor model determinism
Progressively adjust suppression intensity
Significantly improve efficiency while maintaining reasoning quality

Core Contributions

Proposes ARS Framework: The first determinism-guided reasoning suppression method with dynamic suppression through progressive threshold adjustment
Multi-Checkpoint Mechanism: Establishes multiple checkpoints for determinism estimation, overcoming limitations of single-point evaluation
Theoretical Guarantees: Provides theoretical analysis and efficiency guarantees for ARS performance
Comprehensive Evaluation: Validates method effectiveness across multiple model architectures and mathematical reasoning benchmarks
Significant Performance Improvements: Achieves substantial reductions in tokens, latency, and energy consumption while maintaining accuracy

Method Details

Task Definition

Given a reasoning query q and a large reasoning language model π, the standard generation process produces output tokens o = {o₁, o₂, ..., oₜ}, where oₜ ~ π(·|q, o<ₜ). The objective is to minimize expected output length ET while maintaining reasoning accuracy:

min E[T] subject to E[L(f(o), y)] ≤ ε

where f(o) extracts the final answer from output o, y is the ground truth answer, L is the loss function, and ε is the acceptable accuracy degradation threshold.

Model Architecture

The ARS framework comprises three core components:

1. Multi-Checkpoint Determinism Estimation

Establishes multiple checkpoints {c₁, c₂, ..., cₖ} during generation
Estimates model determinism at each checkpoint cᵢ through tentative answer probing
Uses heuristic difficulty estimation function:

D(q) = 0.4 · min(1, |q|words/80) + 0.4 · Σcount(k,q)/(3|K|) + 0.2 · min(1, |symbols(q)|/10)

2. Progressive Threshold Adaptation

Dynamically adjusts suppression thresholds based on reasoning progress patterns
Adapts based on determinism trends
Supports three modes: FAST, MOD, DeepReflect

3. Dynamic Suppression Mechanism

Adaptive suppression intensity control
Based on trigger word set T = {"Wait", "But", "Alternatively", ...}
Suppresses reflective behavior when high determinism is detected

Technical Innovations

Adaptivity: Unlike static suppression methods, ARS dynamically adapts to each model's reasoning trajectory
Multi-Checkpoint Design: Overcomes instability of single-point evaluation
Progressive Adjustment: Dynamically adjusts suppression strategy based on determinism trends
Training-Free Characteristic: Can be directly deployed to existing models without additional fine-tuning

Theoretical Analysis

Theorem 1 (Efficiency Guarantee): For queries with reasoning complexity R(q) ≤ Rmax, the output length TARS produced by ARS satisfies:

E[TARS] ≤ (1 + εR) · T* + O(√log Rmax)

with probability at least 1-δ, where εR → 0 as the number of checkpoints increases.

Experimental Setup

Datasets

GSM8K: Elementary school mathematics word problem dataset
MATH500: High school and university-level mathematics competition problems
Each dataset evaluates n=200 problems

Evaluation Metrics

Acc↑: Accuracy (higher is better)
Lat↓: Latency in seconds (lower is better)
TPC↓: Tokens per correct answer (lower is better)
JPC↓: Joules per correct answer (lower is better)

Comparison Methods

Vanilla: Standard generation
TALE: Token-aware length-constrained reasoning
CGRS: Confidence-guided reasoning suppression

Implementation Details

Models: Qwen2.5-Math-1.5B/7B-Instruct, DeepSeek-R1-Distill-Qwen-7B
Hardware: V100-32GB GPU
Maximum token limit: 1200 tokens per response

Experimental Results

Main Results

GSM8K Dataset Performance:

Qwen-1.5B: 91.0% accuracy, 27.3% latency reduction, 22.5% token reduction, 24.5% energy reduction
Qwen-7B: 94.5% accuracy (8% improvement), 6.3% latency reduction, 16.7% token reduction, 14.3% energy reduction
DeepSeek-7B: 93.0% accuracy, 46.1% latency reduction, 43.5% token reduction, 46.6% energy reduction

MATH500 Dataset Performance:

On the more challenging MATH500, ARS achieves significant efficiency improvements
Token reduction reaches up to 53.0% on DeepSeek-7B model

Key Findings

Variable Efficiency Gains: ARS demonstrates context-dependent performance improvements with maximum token reduction reaching 53%
Accuracy Preservation: Despite efficiency orientation, ARS maintains competitive accuracy across all benchmarks
Architecture-Dependent Performance: DeepSeek-7B shows most consistent improvements, while Qwen models show more variable performance
Multi-Metric Improvements: Beyond tokens, achieves 46.1% latency reduction and 57.9% energy savings

Case Analysis

The paper demonstrates ARS effectiveness through a geometric sequence problem from MATH500:

Difficulty-aware mode selection chooses appropriate reasoning depth
Progressive determinism monitoring enables early confidence stability detection
Adaptive suppression becomes more aggressive as confidence builds
Trend-based adjustment prevents unnecessary reflection loops

Main Research Directions

Prompt Engineering Methods: Guide models to reason within budgets through instructions
Model Training Optimization: Train models to generate concise reasoning
Decoding Strategies: Dynamically adjust the reasoning process

Advantages of This Work

Training-free design enables immediate deployment
Adaptive mechanism provides finer quality-efficiency trade-offs
Multi-checkpoint mechanism improves stability

Conclusions and Discussion

Main Conclusions

ARS successfully addresses key limitations of existing methods by integrating adaptive determinism monitoring, progressive threshold adjustment, and dynamic suppression intensity control. Experiments demonstrate that ARS achieves significant computational efficiency improvements while maintaining or improving accuracy.

Limitations

Maximum Generation Length Constraint: The 1200-token limit may affect accuracy on complex problems
Architecture Dependency: Performance varies significantly across different model architectures
Evaluation Scope: Primarily focused on mathematical reasoning tasks

Future Directions

Extend to broader reasoning paradigms beyond mathematical problem solving
Explore checkpoint-aware scheduling strategies
Develop richer determinism estimation mechanisms tailored to specific model behaviors

In-Depth Evaluation

Strengths

Method Innovation: First to propose adaptive reasoning suppression concept with novel technical approach
Theoretical Foundation: Provides theoretical analysis and performance guarantees
Experimental Sufficiency: Comprehensive evaluation across multiple models and datasets
Practical Value: Training-free characteristic enables easy deployment
Significant Performance: Achieves substantial improvements in efficiency metrics

Weaknesses

Evaluation Limitations: Primarily evaluated on mathematical reasoning tasks; generalization remains to be verified
Limited Baselines: Relatively limited comparison methods; lacks more recent approaches
Theoretical Analysis: Theoretical guarantees lack rigorous proofs
Parameter Sensitivity: Lacks sensitivity analysis for key hyperparameters
Computational Overhead: Insufficient analysis of computational overhead from multi-checkpoint mechanism

Impact

Academic Contribution: Provides new research direction for reasoning efficiency optimization
Practical Value: Significant implications for large model deployment
Reproducibility: Clear algorithm description facilitates reproduction

Applicable Scenarios

Resource-Constrained Environments: Mobile devices, edge computing scenarios
Real-Time Applications: Reasoning tasks requiring rapid response
Cost-Sensitive Applications: Commercial applications requiring computational cost control
Mathematical Reasoning Tasks: Primary application domain currently validated

References

The paper cites 21 relevant references covering important works in large language model reasoning, chain-of-thought, mathematical problem solving, and related fields, providing solid theoretical foundation for the research.

Overall Assessment: This is an important paper making significant contributions to efficiency optimization in large reasoning models. The ARS method is cleverly designed with convincing experimental results, providing an effective solution to the overthinking problem in reasoning models. Despite some limitations, its innovation and practical value make it an important advance in this field.