2025-11-11T07:31:09.386834

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Wang, Hu, Chen et al.

With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.

academic

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Basic Information

Paper ID: 2511.02197
Title: Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
Authors: Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia
Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
Publication Date: November 4, 2025
Paper Link: https://arxiv.org/abs/2511.02197

Abstract

With the widespread application of Large Language Models (LLMs) in code intelligence, the reliability and controllability of their outputs in code reasoning tasks have attracted increasing attention. Confidence estimation serves as an effective and convenient method for evaluating these aspects and holds significant importance. This paper proposes a framework for analyzing and enhancing LLM confidence in code reasoning tasks. The research conducts a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks and further evaluates the effectiveness of techniques such as prompt strategy optimization and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability.

Research Background and Motivation

Problem Definition

This research primarily addresses the confidence reliability problem of LLMs in code reasoning tasks, specifically including:

Confidence Calibration Problem: Current LLMs may exhibit overconfidence or underconfidence in code reasoning
Trustworthiness Assessment Difficulty: Developers struggle to determine the credibility of model outputs, affecting decision-making
Systematic Bias: Significant variations exist in confidence performance across different models and tasks

Research Significance

Practical Value: In software engineering practice, developers need to understand the trustworthiness of model outputs to make informed decisions
Safety Considerations: Incorrect high-confidence predictions may lead to serious software defects
Efficiency Enhancement: Reliable confidence estimation can help developers optimize verification processes

Limitations of Existing Methods

Research Scarcity: Systematic research on confidence reliability for code reasoning tasks is relatively sparse
Insufficient Evaluation: Most existing work relies on objective metrics such as accuracy, overlooking quantification of model self-awareness
Limited Improvement Techniques: Lack of effective technical means to enhance LLM confidence reliability in code reasoning

Core Contributions

Systematic Analysis Framework: Constructs a confidence reliability analysis framework for code reasoning tasks and conducts comprehensive quantitative empirical research
Evaluation of Improvement Techniques: Systematically evaluates the effectiveness of prompt strategy optimization and mathematical calibration methods, revealing their applicability and limitations across different models and tasks
In-depth Analysis of Influencing Factors: Provides deep analysis of confidence reliability's impact on practical software engineering applications and offers feasible recommendations for optimizing LLM confidence mechanisms and engineering deployment
Empirical Findings: Discovers that models with reasoning capabilities demonstrate superior confidence reliability, and hybrid strategies are most effective in improving confidence reliability across various models

Methodology Details

Task Definition

Code reasoning tasks require models to infer code behavior through syntactic, semantic, and logical analysis without executing the program, including input-output behavior, runtime behavior, branch paths, or variable values.

Confidence is defined as the model's subjective probability assessment of its output correctness. For model M, given input x and the set of all correct outputs Y, the model produces output y and assigns confidence p(y|x) ∈ 0,1.

Model Architecture

Four-Step Method Framework

Empirical Study: Prompt LLMs to generate test case answers and corresponding confidence scores
Prompt Strategy Adjustment: Regenerate confidence scores using different prompt strategies
Mathematical Calibration: Apply mathematical methods to process LLM-generated confidence scores
Metric Computation: Calculate various metrics to evaluate the reliability of different types of confidence scores

Confidence Generation Strategies

Intrinsic Confidence: Confidence scores directly generated by the model
Reassess Strategy: Prompts the model to re-evaluate confidence through self-doubt
Reflective Strategy: Uses an independent reflection model to assess the main model's answer confidence

Mathematical Calibration Methods

Employs Platt Scaling for calibration:

p'ᵢⱼ = 1/(1 + exp(-(A·pᵢⱼ + B)))

where A and B are parameters optimized by minimizing negative log-likelihood on calibration data.

Technical Innovations

Multi-dimensional Evaluation System: Combines ECE, Brier Score, and Performance Score metrics for comprehensive confidence reliability assessment
Hybrid Optimization Strategy: Integrates prompt strategy optimization with mathematical calibration for synergistic improvement
Task-Specific Analysis: Provides fine-grained analysis for code reasoning tasks of different complexity levels
Cross-validation Calibration: Employs 5-fold cross-validation to prevent overfitting and ensure statistical validity

Experimental Setup

Datasets

REval: Contains 3,152 test points covering 4 subtasks
- Code Coverage Prediction (CCP)
- Program State Prediction (PSP)
- Execution Path Prediction (EPP)
- Output Prediction (OP)
CRUXEval: Contains 800 independent Python functions covering 2 subtasks
- Input Prediction (CRUXEval-I)
- Output Prediction (CRUXEval-O)

Evaluation Metrics

Expected Calibration Error (ECE):

Eᵢ = (1/|Tᵢ|) Σ |δᵢⱼ - pᵢⱼ|

Brier Score (BS):

Bᵢ = (1/|Tᵢ|) Σ (δᵢⱼ - pᵢⱼ)²

Performance Score (PS):
```
Pᵢ = (B⁰ᵢ - Bᵢ)/B⁰ᵢ
```

Comparison Methods

Selected representative mainstream LLMs:

Reasoning vs Non-reasoning: DeepSeek-V3 vs DeepSeek-R1
Different Scales: Qwen3 series (1.7B, 14B, 32B)
Open-source vs Closed-source: DeepSeek/Qwen3 vs GPT-3.5-Turbo

Implementation Details

Temperature parameter set to 0 for result stability
Unified standardized prompt templates employed
5-fold cross-validation used for Platt Scaling calibration

Experimental Results

Main Results

Inter-model Comparison

DeepSeek-Reasoner Performs Optimally: ECE of only 0.066 on CCP task, significantly outperforming DeepSeek-Chat (0.143), Qwen3-1.7B (0.231), and GPT-3.5-Turbo (0.338)
Reasoning Capability Advantage Evident: DeepSeek-Reasoner outperforms DeepSeek-Chat on all metrics, particularly on CRUXEval tasks
Open-source Models Surpass Closed-source: Mainstream open-source models have exceeded GPT-3.5-Turbo in confidence reliability

Task Complexity Impact

Better Performance on Simple Tasks: CCP and OP tasks generally show superior confidence reliability compared to PSP and EPP
Input Prediction More Challenging: CRUXEval-I typically proves more difficult than CRUXEval-O

Ablation Studies

Prompt Strategy Optimization Effects

Limited Improvement: Reassess and reflective strategies did not bring systematic improvements for most models and tasks
High-performance Models Benefit More: DeepSeek-Reasoner and Qwen3-32B show clear improvements on specific tasks
Overconfidence Mitigation: Reassess strategy helps alleviate model overconfidence in certain cases

Mathematical Calibration Effects

Significant Universal Improvement: Platt Scaling brings significant improvements across all models and tasks
Systematic Bias Elimination: Effectively eliminates distribution discrepancies produced by different confidence generation methods
Negative to Positive Conversion: Multiple models' Performance Scores convert from negative to positive values

Case Analysis

Using GPT-3.5-Turbo's performance on OP tasks as an example:

Pre-calibration: Severe confidence distribution bias with calibration curve deviating from ideal line
After Reassess Strategy: Calibration curve approaches ideal reference line
After Platt Scaling: Probability distribution and calibration curve highly align with ideal curve

Experimental Findings

Reasoning Capability is Key: Models with explicit reasoning capabilities demonstrate clear advantages in confidence reliability
Hybrid Strategy Most Effective: Combining reassess prompt strategy with Platt Scaling achieves optimal improvement
Limited Scale Effect: Confidence reliability improvements from increased model scale plateau after reaching certain scale
Task Specificity Evident: Different task complexity significantly impacts confidence performance

Confidence Calibration Research

Traditional Methods: Early research focused on confidence calibration in small neural models
LLM Applications: Recently extended to natural language understanding, factual question answering, arithmetic reasoning, and other domains
Code Domain: Spiess et al. investigated LLM confidence reliability in code generation tasks

LLMs in Software Engineering

Code Generation and Repair: Substantial research concentrated on code generation or repair tasks
Code Reasoning: Relatively emerging research direction with existing work primarily focusing on operational mechanisms and performance evaluation
Benchmarks: Multiple code reasoning benchmarks have emerged, such as CRUXEval, REval, CodeMind, etc.

Conclusions and Discussion

Main Conclusions

Significant Performance Differences: Current mainstream LLMs exhibit significant differences in confidence reliability on code reasoning tasks
Reasoning Capability Advantage: Models with reasoning capabilities (e.g., DeepSeek-Reasoner) perform best
Mathematical Calibration Effectiveness: Mathematical calibration methods such as Platt Scaling can systematically improve confidence reliability
Substantial Improvement Space: Current LLMs' confidence has not reached ideal reliability levels, particularly in complex reasoning tasks

Limitations

Benchmark-Reality Gap: Inevitable differences exist between benchmark datasets and real-world environments
Model Selection Constraints: Does not include rapidly evolving code-specific LLMs
Fixed Prompt Design: Uses unified standardized prompt design, potentially affecting result generalizability
Fixed Temperature Parameter: Temperature parameter fixed at 0 may overlook its potential performance impact

Future Directions

Confidence Generation Mechanisms: Investigate LLM confidence generation mechanisms in code reasoning tasks in depth
Dynamic Calibration Strategies: Develop adaptive calibration methods and interval partitioning techniques
Active Learning Integration: Deeply integrate confidence with active learning and risk control techniques
Practical Balance: Maintain confidence distribution discriminability and interpretability while improving overall reliability

In-depth Evaluation

Strengths

Significant Research Value: Fills the gap in confidence reliability research in the code reasoning domain
Systematic and Complete Methodology: Proposes a four-step systematic analysis framework with rigorous methodology
Sufficient Experimental Design: Covers multiple models, tasks, and improvement strategies with comprehensive experimental setup
Convincing Results: Validates conclusion reliability through multiple metrics and statistical methods
High Practical Value: Provides directly applicable technical guidance for software engineering practice

Limitations

Single Calibration Method: Primarily employs Platt Scaling without exploring effects of other calibration methods
Discriminability Loss: Mathematical calibration, while improving overall calibration, may reduce confidence discriminability
Missing Code-specific Models: Does not include code-specific models such as CodeLlama and StarCoder
Insufficient Dynamic Adaptability: Proposed methods are primarily static, lacking dynamic adaptability to different scenarios

Impact

Academic Contribution: Opens new application domains for LLM confidence research
Engineering Practice: Provides technical foundation for trustworthiness assessment in AI-assisted software development
Standard Setting: May promote establishment of confidence evaluation standards for code reasoning tasks
Subsequent Research: Provides important reference for in-depth research in related fields

Applicable Scenarios

Code Review: Helps developers assess trustworthiness of AI-generated code
Automated Testing: Provides confidence guidance in test case generation
Code Refactoring: Offers trustworthiness assessment for refactoring suggestions
Educational Training: Assists learners in understanding code logic in programming education

References

The paper cites important works in related fields, including:

Brier (1950): Classical work on probability prediction verification
Guo et al. (2017): Important research on modern neural network calibration
Jiang et al. (2021): Pioneering work on LLM confidence calibration
Spiess et al. (2024): Related research on LLM confidence in code tasks

Summary: This is a high-quality empirical research paper that systematically explores the confidence reliability problem of LLMs in code reasoning tasks. The paper employs rigorous methodology, comprehensive experiments, and conclusions with significant theoretical value and practical importance, providing important contributions to the development of AI-assisted software engineering.