2025-11-13T09:01:14.934288

ADVICE: Answer-Dependent Verbalized Confidence Estimation

Seo, Lim, Kim
Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model's failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.
academic

ADVICE: Answer-Dependent Verbalized Confidence Estimation

Basic Information

  • Paper ID: 2510.10913
  • Title: ADVICE: Answer-Dependent Verbalized Confidence Estimation
  • Authors: Ki Jung Seo, Sehun Lim, Taeuk Kim (Hanyang University)
  • Category: cs.CL (Computational Linguistics)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10913

Abstract

Large language models (LLMs) have made significant progress in expressing confidence through natural language, enhancing transparency and reliability. However, their confidence estimates often exhibit overconfidence bias, whose underlying causes remain insufficiently understood. This study provides a detailed analysis of the intrinsic dynamics of verbalized confidence, identifying "answer-independence" as a key factor—the model's failure to modulate confidence based on its own generated answers. To address this issue, the authors propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that promotes answer-dependent confidence estimation. Extensive experiments demonstrate that ADVICE significantly improves confidence calibration while maintaining task performance. Further analysis confirms that ADVICE enhances answer-dependency, producing more balanced and well-calibrated confidence distributions.

Research Background and Motivation

Problem Definition

  1. Core Problem: Large language models exhibit severe overconfidence bias when generating verbalized confidence, tending to express high confidence regardless of answer correctness
  2. Significance: When deploying LLMs in high-risk domains such as law and medicine, reliable confidence estimation is crucial for managing the model's inherent limitations
  3. Limitations of Existing Approaches:
    • Existing research primarily focuses on "how" to mitigate overconfidence rather than "why" it occurs
    • Lack of deep understanding of the intrinsic mechanisms of verbalized confidence
    • While prompting methods, sampling methods, and fine-tuning approaches show improvements, the underlying causes remain unclear

Research Motivation

Inspired by confidence estimation theories in neuroscience, the authors frame confidence estimation as a post-decision evidence accumulation process, discovering that LLMs often ignore information from their own generated answers when estimating confidence, which contradicts the definition of confidence.

Core Contributions

  1. Theoretical Finding: First systematically identifies and analyzes "answer-independence" as the fundamental cause of LLM overconfidence
  2. Analysis Method: Proposes a dual verification approach based on probability distribution comparison and attribution analysis to quantify answer-dependency
  3. Solution: Designs the ADVICE fine-tuning framework that explicitly encourages the model to focus on its generated answers when reporting confidence
  4. Empirical Validation: Validates the method's effectiveness across multiple datasets and models, demonstrating the importance of answer information in confidence estimation
  5. Generalization Capability: Demonstrates strong generalization ability on out-of-distribution tasks and balanced confidence distribution characteristics

Methodology Details

Task Definition

Given a question q and corresponding answer a, verbalized confidence should approximate the probability that the answer is correct: P(correct|q,a). Ideal confidence estimation should:

  • Express high confidence when the answer is correct
  • Express low confidence when the answer is incorrect
  • Adjust confidence levels based on answer content

Answer-Independence Analysis

1. Probability Distribution Comparison Method

Verifies answer-independence by comparing the following two distributions:

P_M(C | q, a) ≈ P_M(C | q) ∀a ∈ A_q

where the right-hand side is expanded via the law of total probability:

P_M(C | q) = Σ_{a'∈A_q} P_M(C | q, a') P_M(a' | q)

Uses Jensen-Shannon divergence (JSD) to quantify the difference between the two distributions; JSD values close to 0 indicate the model is insensitive to answer information.

2. Attribution Analysis Method

  • Attention Rollout: Analyzes attention weights of confidence generation toward answer tokens
  • Integrated Gradients: Computes the contribution of answer tokens to confidence prediction

ADVICE Framework Design

Training Data Construction

  1. Sample 2000 instances from TriviaQA
  2. For each question q, construct triplets (q, a_correct, a_wrong)
  3. Construct three linguistic format variants to enhance generalization

Training Objectives

Define three loss functions:

  1. Language Modeling Loss:
L_LM = (1/|a_correct|) Σ_{x_t∈a_correct} -log P(x_t | x_<t)

Preserves the model's original QA capability

  1. Contrastive Distribution Loss:
L_JSD = max(0, δ_JSD - D_JSD(P_correct || P_wrong))

Drives the model to learn to distinguish confidence distributions between correct and incorrect answers

  1. Margin Loss:
L_Margin = max(0, δ_Margin - (μ_correct - μ_wrong))

Ensures correct answers receive higher expected confidence

Total loss function:

L = λ_LM L_LM + λ_JSD L_JSD + λ_Margin L_Margin

Technical Innovations

  1. Root Cause Analysis: First analyzes overconfidence from the perspective of answer-dependency
  2. Dual Verification: Combines probabilistic analysis and neural network attribution methods to validate hypotheses
  3. Contrastive Learning: Employs contrastive training using correct/incorrect answer pairs
  4. Multi-objective Optimization: Balances task performance preservation and confidence calibration improvement

Experimental Setup

Datasets

  • Training: TriviaQA (2000 instances)
  • Evaluation: TriviaQA, MMLU, SciQ, LogiQA (testing cross-domain generalization)

Models

  • LLAMA-3.1-8B-INSTRUCT
  • MISTRAL-7B-INSTRUCT-V0.3
  • GEMMA-2-9B-IT

Confidence Expression Types

  • ScoreText: {low, medium, high}
  • ScoreLetter: {E, D, C, B, A}
  • ScoreNumber: {0, 1, ..., 9}
  • ScoreFloat: 0.0, 1.0
  • ScorePercent: {0%, 1%, ..., 100%}

Evaluation Metrics

  • ECE (Expected Calibration Error): Average absolute difference between predicted confidence and actual accuracy
  • NCE (Net Calibration Error): Signed calibration error reflecting bias
  • BS (Brier Score): Mean squared error of probability predictions
  • AUROC: Confidence ranking ability

Baseline Methods

  • Default: Basic prompting method
  • Self-Consistency: Sampling-based method
  • ConfTuner: Current state-of-the-art fine-tuning method

Experimental Results

Main Results

Performance comparison on TriviaQA (GEMMA-2-9B-IT):

  • ECE: Default (21.9%) → ADVICE (6.5%)
  • NCE: Default (-21.8%) → ADVICE (1.6%)
  • AUROC: Default (52.7%) → ADVICE (78.5%)

Cross-domain generalization results show ADVICE achieves significant improvements on MMLU, SciQ, and LogiQA, demonstrating the method's robustness.

Ablation Studies

Analysis of loss function contributions:

  • L_JSD alone: ECE reduced from 19.7% to 4.9%
  • L_Margin alone: ECE reduced from 19.7% to 3.9%
  • Complete ADVICE: Best cross-dataset generalization capability

Key Findings

  1. Answer-Independence Verification: JSD distributions exhibit power-law patterns with most values close to 0, confirming the answer-independence hypothesis
  2. Attention Patterns: Attention weights from confidence to answers are significantly lower than other directions
  3. Calibration Improvement: Reliability diagrams show ADVICE produces finer-grained and more accurate confidence distributions
  4. Answer Awareness Enhancement: Masking experiments show ADVICE appropriately expresses uncertainty when answers are absent

Hyperparameter Analysis

Increasing δ_JSD continuously reduces ECE, validating the effectiveness of the contrastive learning objective.

Verbalized Confidence Research

  • Lin et al. (2022) first introduced verbalized confidence estimation
  • Subsequent research primarily divides into three categories: prompting methods, sampling methods, and fine-tuning methods
  • This research fills the gap in mechanism analysis

LLM Probing Methods

  • Attention mechanism analysis: Attention Rollout, Attention Flow, etc.
  • Gradient attribution methods: Integrated Gradients, etc.
  • This research innovatively applies these methods to confidence analysis

Conclusions and Discussion

Main Conclusions

  1. LLM overconfidence primarily stems from answer-independence issues
  2. ADVICE effectively improves confidence calibration by enhancing answer-dependency
  3. The method demonstrates good generalization capability and practical value

Limitations

  1. Primarily focuses on short-text QA tasks; applicability to long-text comprehension tasks requires further verification
  2. Requires additional data construction costs to generate contrastive answer pairs
  3. Effectiveness on complex reasoning tasks requires further exploration

Future Directions

  1. Extend to tasks requiring long-context understanding and complex reasoning
  2. Explore more efficient training data construction methods
  3. Investigate applications in other modalities (e.g., vision-language models)

In-Depth Evaluation

Strengths

  1. Outstanding Theoretical Contribution: First systematically analyzes the root cause of overconfidence, providing important theoretical insights
  2. Rigorous Methodology: Employs multi-perspective verification (probabilistic analysis + attribution analysis) with high credibility
  3. Well-Designed Experiments: Comprehensive evaluation across models and datasets with thorough ablation studies
  4. Significant Practical Value: Significantly improves confidence calibration while maintaining task performance
  5. Strong Generalization: Performs well on out-of-distribution data, demonstrating method robustness

Weaknesses

  1. Limited Task Scope: Primarily validates QA tasks; applicability to other NLP tasks insufficiently explored
  2. Computational Overhead: Requires additional fine-tuning and contrastive data construction
  3. Insufficient Theoretical Depth: While identifying answer-independence, analysis of its deeper causes is incomplete
  4. Long-term Effects: Does not evaluate model stability in long-term use after fine-tuning

Impact

  1. Academic Value: Provides new research perspectives and analytical frameworks for the confidence estimation field
  2. Practical Significance: Important for improving LLM reliability in high-risk applications
  3. Reproducibility: Provides detailed implementation details and open-source code for easy reproduction and extension

Applicable Scenarios

  • Question-answering systems requiring reliable confidence estimation
  • High-risk decision support systems
  • Uncertainty expression in human-machine collaboration scenarios
  • Model calibration and trustworthy AI applications

References

The paper cites 68 relevant references covering multiple fields including verbalized confidence, LLM probing methods, and calibration theory, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality research paper with important contributions in both theoretical analysis and practical methodology. The authors not only identify the root cause of LLM overconfidence but also propose an effective solution. The method is simple yet effective, the experimental design is rigorous, and the results are convincing. It has significant importance for advancing trustworthy AI and improving LLM reliability in practical applications.