Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
Sharma, Chopra
We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
academic
Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
This study proposes a novel entropy-based framework that enables early stopping in large language model reasoning tasks through Shannon entropy as a confidence signal, achieving 25-50% computational savings while maintaining task accuracy. The key finding is that entropy-based confidence calibration is an emergent property of advanced post-training optimization in modern reasoning models, but is significantly absent in standard instruction-tuned and pretrained models (such as Llama 3.3 70B). The research demonstrates that advanced reasoning models often know early whether they have arrived at the correct answer, and this emergent confidence awareness can be leveraged to save tokens and reduce latency.
As large language models approach saturation in reasoning benchmark performance, the computational cost of reasoning inference continues to escalate, with reasoning costs for individual difficult problems potentially reaching thousands of dollars. This high cost and associated latency have motivated researchers to seek methods for reducing token usage without compromising accuracy.
This paper addresses this gap by introducing a universal framework based on Shannon entropy, providing principled algorithmic interventions for confidence estimation in LLM mathematical reasoning. The approach is grounded in information theory and statistical decision theory, offering both theoretical rigor and practical applicability.
Accuracy Preservation: Maintains task accuracy while achieving 25-50% computational savings with no statistically significant performance degradation
Practical Deployment: Achieves threshold equivalence with minimal samples (5-10), supporting rapid deployment across diverse reasoning benchmarks
Enhanced Token Budget Framework: A computational allocation scheme that redirects savings from simple, low-uncertainty problems to difficult, high-uncertainty problems
Theoretical Foundation: Four mathematically principled threshold methods based on information theory and Bayesian decision theory
Given a reasoning problem q, model M, and threshold τ, the system must decide whether to stop after the first reasoning step (when confidence is sufficiently high) or continue expanding reasoning. The input is a reasoning problem, the output is an answer, and the constraint is minimizing computational cost while maintaining accuracy.
Scale-Invariant Universal Method: Adapts to different model characteristics through effect size normalization
τuniversal=μc+1+∣d∣∣d∣×(μi−μc)×max(0,1−μcσc)
The paper cites important works in related fields, including early exit methods (DeeBERT, CALM), entropy-based stopping strategies (HALT-CoT, AdaDec), and confidence estimation research, providing solid theoretical foundations and comparative baselines for this work.
Overall Assessment: This is a high-quality research paper with significant contributions in theoretical innovation, experimental validation, and practical value. Particularly, the discovery of confidence calibration as an emergent property provides new scientific insights for understanding modern LLM capabilities. The method is simple, effective, and has broad application prospects.