2025-11-11T18:07:09.125558

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

Sharma, Chopra

We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

academic

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

基本信息

论文ID: 2510.08146
标题: Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
作者: Aman Sharma, Paras Chopra (Lossfunk)
分类: cs.LG cs.AI
发表时间: 2025年10月16日 (arXiv v2)
论文链接: https://arxiv.org/abs/2510.08146v2

摘要

本研究提出了一个基于熵的新颖框架，通过Shannon熵作为置信度信号实现大语言模型推理任务中的early stopping，在保持任务准确性的同时实现25-50%的计算节省。关键发现是基于熵的置信度校准是现代推理模型中高级后训练优化的涌现属性，但在标准指令调优和预训练模型（如Llama 3.3 70B）中显著缺失。研究表明，高级推理模型往往在早期就知道自己得到了正确答案，这种涌现的置信度感知可以被利用来节省token和减少延迟。

研究背景与动机

问题定义

随着大语言模型在推理基准测试中表现日趋饱和，推理推断的成本却不断攀升，单个困难问题的推理成本可能达到数千美元。这种高昂的成本和相关延迟促使研究者寻找在不影响准确性的前提下减少token使用的方法。

现有方法局限性

当前推理任务中的计算优化方法缺乏理论基础和跨模型架构的通用适用性：

现有置信度度量依赖于临时阈值或简单启发式
无法在不同模型规模或推理领域间泛化
缺乏理论基础和实际部署需求之间存在关键差距

研究动机

本文通过引入基于Shannon熵的通用框架来解决这一差距，为LLM数学推理中的置信度估计提供有原则的算法干预。该方法基于信息论和统计决策理论，提供理论严谨性和实际适用性。

核心贡献

准确性保持: 在实现25-50%计算节省的同时保持任务准确性，无统计显著性下降
实用部署: 通过最少样本（5-10个）实现阈值等效性，支持跨多样化推理基准的快速部署
增强token预算框架: 一种计算分配方案，将节省的资源从简单、低不确定性问题转移到困难、高不确定性问题
理论基础: 基于信息论和贝叶斯决策理论的四种数学原理化阈值方法

方法详解

任务定义

给定推理问题q、模型M和阈值τ，系统需要决定是否在第一步推理后停止（当置信度足够高时）还是继续扩展推理。输入为推理问题，输出为答案，约束条件是在保持准确性的同时最小化计算成本。

核心技术框架

Shannon熵作为置信度信号

使用top-k token logprobs的Shannon熵作为置信度度量（k=20）：

logprobs标准化: $p_i = \frac{e^{\ell_i}}{\sum_{j=1}^{20} e^{\ell_j}}$
Shannon熵计算: $H = -\sum_{i=1}^{20} p_i \log_2 p_i$
序列级置信度信号: $H_{mean} = \frac{1}{T} \sum_{t=1}^T H_t$

四种阈值方法

熵均值法（Entropy Mean）: 使用正确答案熵分布的均值作为阈值 $\tau_{mean} = \mu_c$
信息论最优法: 使用对数缩放和效应量最大化信息增益 $\tau_{info} = \mu_c + \sigma_c \times \ln(1 + |d|)$
贝叶斯最优法: 在高斯假设下最小化分类误差的数学最优决策边界 $\tau_{bayes} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$
尺度不变通用法: 通过效应量标准化适应不同模型特征 $\tau_{universal} = \mu_c + \frac{\sqrt{|d|}}{1+\sqrt{|d|}} \times (\mu_i - \mu_c) \times \max(0, 1-\frac{\sigma_c}{\mu_c})$