2025-11-11T07:31:09.386834

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Wang, Hu, Chen et al.
With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.
academic

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Basic Information

  • Paper ID: 2511.02197
  • Title: Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
  • Authors: Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia
  • Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
  • Publication Date: November 4, 2025
  • Paper Link: https://arxiv.org/abs/2511.02197

Abstract

With the widespread application of Large Language Models (LLMs) in code intelligence, the reliability and controllability of their outputs in code reasoning tasks have attracted increasing attention. Confidence estimation serves as an effective and convenient method for evaluating these aspects and holds significant importance. This paper proposes a framework for analyzing and enhancing LLM confidence in code reasoning tasks. The research conducts a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks and further evaluates the effectiveness of techniques such as prompt strategy optimization and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability.

Research Background and Motivation

Problem Definition

This research primarily addresses the confidence reliability problem of LLMs in code reasoning tasks, specifically including:

  1. Confidence Calibration Problem: Current LLMs may exhibit overconfidence or underconfidence in code reasoning
  2. Trustworthiness Assessment Difficulty: Developers struggle to determine the credibility of model outputs, affecting decision-making
  3. Systematic Bias: Significant variations exist in confidence performance across different models and tasks

Research Significance

  1. Practical Value: In software engineering practice, developers need to understand the trustworthiness of model outputs to make informed decisions
  2. Safety Considerations: Incorrect high-confidence predictions may lead to serious software defects
  3. Efficiency Enhancement: Reliable confidence estimation can help developers optimize verification processes

Limitations of Existing Methods

  1. Research Scarcity: Systematic research on confidence reliability for code reasoning tasks is relatively sparse
  2. Insufficient Evaluation: Most existing work relies on objective metrics such as accuracy, overlooking quantification of model self-awareness
  3. Limited Improvement Techniques: Lack of effective technical means to enhance LLM confidence reliability in code reasoning

Core Contributions

  1. Systematic Analysis Framework: Constructs a confidence reliability analysis framework for code reasoning tasks and conducts comprehensive quantitative empirical research
  2. Evaluation of Improvement Techniques: Systematically evaluates the effectiveness of prompt strategy optimization and mathematical calibration methods, revealing their applicability and limitations across different models and tasks
  3. In-depth Analysis of Influencing Factors: Provides deep analysis of confidence reliability's impact on practical software engineering applications and offers feasible recommendations for optimizing LLM confidence mechanisms and engineering deployment
  4. Empirical Findings: Discovers that models with reasoning capabilities demonstrate superior confidence reliability, and hybrid strategies are most effective in improving confidence reliability across various models

Methodology Details

Task Definition

Code reasoning tasks require models to infer code behavior through syntactic, semantic, and logical analysis without executing the program, including input-output behavior, runtime behavior, branch paths, or variable values.

Confidence is defined as the model's subjective probability assessment of its output correctness. For model M, given input x and the set of all correct outputs Y, the model produces output y and assigns confidence p(y|x) ∈ 0,1.

Model Architecture

Four-Step Method Framework

  1. Empirical Study: Prompt LLMs to generate test case answers and corresponding confidence scores
  2. Prompt Strategy Adjustment: Regenerate confidence scores using different prompt strategies
  3. Mathematical Calibration: Apply mathematical methods to process LLM-generated confidence scores
  4. Metric Computation: Calculate various metrics to evaluate the reliability of different types of confidence scores

Confidence Generation Strategies

  1. Intrinsic Confidence: Confidence scores directly generated by the model
  2. Reassess Strategy: Prompts the model to re-evaluate confidence through self-doubt
  3. Reflective Strategy: Uses an independent reflection model to assess the main model's answer confidence

Mathematical Calibration Methods

Employs Platt Scaling for calibration:

p'ᵢⱼ = 1/(1 + exp(-(A·pᵢⱼ + B)))

where A and B are parameters optimized by minimizing negative log-likelihood on calibration data.

Technical Innovations

  1. Multi-dimensional Evaluation System: Combines ECE, Brier Score, and Performance Score metrics for comprehensive confidence reliability assessment
  2. Hybrid Optimization Strategy: Integrates prompt strategy optimization with mathematical calibration for synergistic improvement
  3. Task-Specific Analysis: Provides fine-grained analysis for code reasoning tasks of different complexity levels
  4. Cross-validation Calibration: Employs 5-fold cross-validation to prevent overfitting and ensure statistical validity

Experimental Setup

Datasets

  1. REval: Contains 3,152 test points covering 4 subtasks
    • Code Coverage Prediction (CCP)
    • Program State Prediction (PSP)
    • Execution Path Prediction (EPP)
    • Output Prediction (OP)
  2. CRUXEval: Contains 800 independent Python functions covering 2 subtasks
    • Input Prediction (CRUXEval-I)
    • Output Prediction (CRUXEval-O)

Evaluation Metrics

  1. Expected Calibration Error (ECE):
    Eᵢ = (1/|Tᵢ|) Σ |δᵢⱼ - pᵢⱼ|
    
  2. Brier Score (BS):
    Bᵢ = (1/|Tᵢ|) Σ (δᵢⱼ - pᵢⱼ)²
    
  3. Performance Score (PS):
    Pᵢ = (B⁰ᵢ - Bᵢ)/B⁰ᵢ
    

Comparison Methods

Selected representative mainstream LLMs:

  • Reasoning vs Non-reasoning: DeepSeek-V3 vs DeepSeek-R1
  • Different Scales: Qwen3 series (1.7B, 14B, 32B)
  • Open-source vs Closed-source: DeepSeek/Qwen3 vs GPT-3.5-Turbo

Implementation Details

  • Temperature parameter set to 0 for result stability
  • Unified standardized prompt templates employed
  • 5-fold cross-validation used for Platt Scaling calibration

Experimental Results

Main Results

Inter-model Comparison

  • DeepSeek-Reasoner Performs Optimally: ECE of only 0.066 on CCP task, significantly outperforming DeepSeek-Chat (0.143), Qwen3-1.7B (0.231), and GPT-3.5-Turbo (0.338)
  • Reasoning Capability Advantage Evident: DeepSeek-Reasoner outperforms DeepSeek-Chat on all metrics, particularly on CRUXEval tasks
  • Open-source Models Surpass Closed-source: Mainstream open-source models have exceeded GPT-3.5-Turbo in confidence reliability

Task Complexity Impact

  • Better Performance on Simple Tasks: CCP and OP tasks generally show superior confidence reliability compared to PSP and EPP
  • Input Prediction More Challenging: CRUXEval-I typically proves more difficult than CRUXEval-O

Ablation Studies

Prompt Strategy Optimization Effects

  • Limited Improvement: Reassess and reflective strategies did not bring systematic improvements for most models and tasks
  • High-performance Models Benefit More: DeepSeek-Reasoner and Qwen3-32B show clear improvements on specific tasks
  • Overconfidence Mitigation: Reassess strategy helps alleviate model overconfidence in certain cases

Mathematical Calibration Effects

  • Significant Universal Improvement: Platt Scaling brings significant improvements across all models and tasks
  • Systematic Bias Elimination: Effectively eliminates distribution discrepancies produced by different confidence generation methods
  • Negative to Positive Conversion: Multiple models' Performance Scores convert from negative to positive values

Case Analysis

Using GPT-3.5-Turbo's performance on OP tasks as an example:

  • Pre-calibration: Severe confidence distribution bias with calibration curve deviating from ideal line
  • After Reassess Strategy: Calibration curve approaches ideal reference line
  • After Platt Scaling: Probability distribution and calibration curve highly align with ideal curve

Experimental Findings

  1. Reasoning Capability is Key: Models with explicit reasoning capabilities demonstrate clear advantages in confidence reliability
  2. Hybrid Strategy Most Effective: Combining reassess prompt strategy with Platt Scaling achieves optimal improvement
  3. Limited Scale Effect: Confidence reliability improvements from increased model scale plateau after reaching certain scale
  4. Task Specificity Evident: Different task complexity significantly impacts confidence performance

Confidence Calibration Research

  • Traditional Methods: Early research focused on confidence calibration in small neural models
  • LLM Applications: Recently extended to natural language understanding, factual question answering, arithmetic reasoning, and other domains
  • Code Domain: Spiess et al. investigated LLM confidence reliability in code generation tasks

LLMs in Software Engineering

  • Code Generation and Repair: Substantial research concentrated on code generation or repair tasks
  • Code Reasoning: Relatively emerging research direction with existing work primarily focusing on operational mechanisms and performance evaluation
  • Benchmarks: Multiple code reasoning benchmarks have emerged, such as CRUXEval, REval, CodeMind, etc.

Conclusions and Discussion

Main Conclusions

  1. Significant Performance Differences: Current mainstream LLMs exhibit significant differences in confidence reliability on code reasoning tasks
  2. Reasoning Capability Advantage: Models with reasoning capabilities (e.g., DeepSeek-Reasoner) perform best
  3. Mathematical Calibration Effectiveness: Mathematical calibration methods such as Platt Scaling can systematically improve confidence reliability
  4. Substantial Improvement Space: Current LLMs' confidence has not reached ideal reliability levels, particularly in complex reasoning tasks

Limitations

  1. Benchmark-Reality Gap: Inevitable differences exist between benchmark datasets and real-world environments
  2. Model Selection Constraints: Does not include rapidly evolving code-specific LLMs
  3. Fixed Prompt Design: Uses unified standardized prompt design, potentially affecting result generalizability
  4. Fixed Temperature Parameter: Temperature parameter fixed at 0 may overlook its potential performance impact

Future Directions

  1. Confidence Generation Mechanisms: Investigate LLM confidence generation mechanisms in code reasoning tasks in depth
  2. Dynamic Calibration Strategies: Develop adaptive calibration methods and interval partitioning techniques
  3. Active Learning Integration: Deeply integrate confidence with active learning and risk control techniques
  4. Practical Balance: Maintain confidence distribution discriminability and interpretability while improving overall reliability

In-depth Evaluation

Strengths

  1. Significant Research Value: Fills the gap in confidence reliability research in the code reasoning domain
  2. Systematic and Complete Methodology: Proposes a four-step systematic analysis framework with rigorous methodology
  3. Sufficient Experimental Design: Covers multiple models, tasks, and improvement strategies with comprehensive experimental setup
  4. Convincing Results: Validates conclusion reliability through multiple metrics and statistical methods
  5. High Practical Value: Provides directly applicable technical guidance for software engineering practice

Limitations

  1. Single Calibration Method: Primarily employs Platt Scaling without exploring effects of other calibration methods
  2. Discriminability Loss: Mathematical calibration, while improving overall calibration, may reduce confidence discriminability
  3. Missing Code-specific Models: Does not include code-specific models such as CodeLlama and StarCoder
  4. Insufficient Dynamic Adaptability: Proposed methods are primarily static, lacking dynamic adaptability to different scenarios

Impact

  1. Academic Contribution: Opens new application domains for LLM confidence research
  2. Engineering Practice: Provides technical foundation for trustworthiness assessment in AI-assisted software development
  3. Standard Setting: May promote establishment of confidence evaluation standards for code reasoning tasks
  4. Subsequent Research: Provides important reference for in-depth research in related fields

Applicable Scenarios

  1. Code Review: Helps developers assess trustworthiness of AI-generated code
  2. Automated Testing: Provides confidence guidance in test case generation
  3. Code Refactoring: Offers trustworthiness assessment for refactoring suggestions
  4. Educational Training: Assists learners in understanding code logic in programming education

References

The paper cites important works in related fields, including:

  • Brier (1950): Classical work on probability prediction verification
  • Guo et al. (2017): Important research on modern neural network calibration
  • Jiang et al. (2021): Pioneering work on LLM confidence calibration
  • Spiess et al. (2024): Related research on LLM confidence in code tasks

Summary: This is a high-quality empirical research paper that systematically explores the confidence reliability problem of LLMs in code reasoning tasks. The paper employs rigorous methodology, comprehensive experiments, and conclusions with significant theoretical value and practical importance, providing important contributions to the development of AI-assisted software engineering.