2025-11-11T07:31:09.386834

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Wang, Hu, Chen et al.

With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.

academic

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

基本信息

论文ID: 2511.02197
标题: Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
作者: Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia
分类: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
发表时间: 2025年11月4日
论文链接: https://arxiv.org/abs/2511.02197

置信度校准问题：当前LLMs在代码推理中可能表现出过度自信或自信不足的行为
可信度评估困难：开发者难以判断模型输出的可信程度，影响决策制定
系统性偏差：不同模型在不同任务上的置信度表现存在显著差异

研究重要性

实用价值：在软件工程实践中，开发者需要了解模型输出的可信度来做出明智决策
安全考虑：错误的高置信度预测可能导致严重的软件缺陷
效率提升：可靠的置信度估计可以帮助开发者优化验证流程

现有方法局限性

研究稀缺：针对代码推理任务的置信度可靠性系统性研究相对稀少
评估不足：大多数现有工作依赖准确率等客观指标，忽略了模型自我感知的量化
改进技术有限：缺乏有效的技术手段来提升LLM在代码推理中的置信度可靠性

核心贡献

提出系统性分析框架：构建了针对代码推理任务的LLM置信度可靠性分析框架，并进行了全面的定量实证研究
评估改进技术：系统评估了提示策略优化和数学校准方法的有效性，揭示了其在不同模型和任务上的适用性和局限性
深入分析影响因素：提供了置信度可靠性对实际软件工程应用影响的深入分析，并为LLM置信度机制的优化和工程部署提供了可行建议
实证发现：发现具有推理能力的模型在置信度可靠性方面表现更优，混合策略在提升各种模型置信度可靠性方面最为有效

实证研究：提示LLMs生成测试用例答案及相应置信度分数
提示策略调整：采用不同提示策略重新生成置信度分数
数学校准：应用数学方法处理LLMs生成的置信度分数
指标计算：计算各种指标评估不同类型置信度分数的可靠性

置信度生成策略

内在置信度（Intrinsic Confidence）：模型直接生成的置信度分数
重评估策略（Reassess Strategy）：通过自我怀疑提示模型重新评估置信度
反思策略（Reflective Strategy）：使用独立的反思模型评估主模型答案的置信度

数学校准方法

采用Platt Scaling进行校准：

p'ᵢⱼ = 1/(1 + exp(-(A·pᵢⱼ + B)))

其中A和B是通过最小化校准数据负对数似然优化的参数。

技术创新点

多维度评估体系：结合ECE、Brier Score和Performance Score三个指标全面评估置信度可靠性
混合优化策略：将提示策略优化与数学校准相结合，实现协同改进
任务特异性分析：针对不同复杂度的代码推理任务进行细粒度分析
交叉验证校准：采用5折交叉验证防止过拟合，确保统计有效性

实验设置

数据集

REval：包含3,152个测试点，涵盖4个子任务
- 代码覆盖预测（CCP）
- 程序状态预测（PSP）
- 执行路径预测（EPP）
- 输出预测（OP）
CRUXEval：包含800个独立Python函数，涵盖2个子任务
- 输入预测（CRUXEval-I）
- 输出预测（CRUXEval-O）

评价指标

期望校准误差（ECE）：

Eᵢ = (1/|Tᵢ|) Σ |δᵢⱼ - pᵢⱼ|

Brier分数（BS）：

Bᵢ = (1/|Tᵢ|) Σ (δᵢⱼ - pᵢⱼ)²

性能分数（PS）：
```
Pᵢ = (B⁰ᵢ - Bᵢ)/B⁰ᵢ
```

对比方法

选择了具有代表性的主流LLMs：

推理vs非推理：DeepSeek-V3 vs DeepSeek-R1
不同规模：Qwen3系列（1.7B, 14B, 32B）
开源vs闭源：DeepSeek/Qwen3 vs GPT-3.5-Turbo

实现细节

温度参数设置为0以确保结果稳定性
采用统一标准化的提示模板
使用5折交叉验证进行Platt Scaling校准

DeepSeek-Reasoner表现最优：在CCP任务上ECE仅为0.066，显著优于DeepSeek-Chat（0.143）、Qwen3-1.7B（0.231）和GPT-3.5-Turbo（0.338）
推理能力优势明显：DeepSeek-Reasoner在所有指标上均优于DeepSeek-Chat，特别是在CRUXEval任务上
开源模型超越闭源：主流开源模型在置信度可靠性方面已超越GPT-3.5-Turbo