Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
Wang, Hu, Chen et al.
With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.
academic
Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
With the widespread application of Large Language Models (LLMs) in code intelligence, the reliability and controllability of their outputs in code reasoning tasks have attracted increasing attention. Confidence estimation serves as an effective and convenient method for evaluating these aspects and holds significant importance. This paper proposes a framework for analyzing and enhancing LLM confidence in code reasoning tasks. The research conducts a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks and further evaluates the effectiveness of techniques such as prompt strategy optimization and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability.
Systematic Analysis Framework: Constructs a confidence reliability analysis framework for code reasoning tasks and conducts comprehensive quantitative empirical research
Evaluation of Improvement Techniques: Systematically evaluates the effectiveness of prompt strategy optimization and mathematical calibration methods, revealing their applicability and limitations across different models and tasks
In-depth Analysis of Influencing Factors: Provides deep analysis of confidence reliability's impact on practical software engineering applications and offers feasible recommendations for optimizing LLM confidence mechanisms and engineering deployment
Empirical Findings: Discovers that models with reasoning capabilities demonstrate superior confidence reliability, and hybrid strategies are most effective in improving confidence reliability across various models
Code reasoning tasks require models to infer code behavior through syntactic, semantic, and logical analysis without executing the program, including input-output behavior, runtime behavior, branch paths, or variable values.
Confidence is defined as the model's subjective probability assessment of its output correctness. For model M, given input x and the set of all correct outputs Y, the model produces output y and assigns confidence p(y|x) ∈ 0,1.
DeepSeek-Reasoner Performs Optimally: ECE of only 0.066 on CCP task, significantly outperforming DeepSeek-Chat (0.143), Qwen3-1.7B (0.231), and GPT-3.5-Turbo (0.338)
Reasoning Capability Advantage Evident: DeepSeek-Reasoner outperforms DeepSeek-Chat on all metrics, particularly on CRUXEval tasks
Open-source Models Surpass Closed-source: Mainstream open-source models have exceeded GPT-3.5-Turbo in confidence reliability
The paper cites important works in related fields, including:
Brier (1950): Classical work on probability prediction verification
Guo et al. (2017): Important research on modern neural network calibration
Jiang et al. (2021): Pioneering work on LLM confidence calibration
Spiess et al. (2024): Related research on LLM confidence in code tasks
Summary: This is a high-quality empirical research paper that systematically explores the confidence reliability problem of LLMs in code reasoning tasks. The paper employs rigorous methodology, comprehensive experiments, and conclusions with significant theoretical value and practical importance, providing important contributions to the development of AI-assisted software engineering.