2025-11-13T09:01:14.934288

ADVICE: Answer-Dependent Verbalized Confidence Estimation

Seo, Lim, Kim

Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model's failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.

academic

ADVICE: Answer-Dependent Verbalized Confidence Estimation

Basic Information

Paper ID: 2510.10913
Title: ADVICE: Answer-Dependent Verbalized Confidence Estimation
Authors: Ki Jung Seo, Sehun Lim, Taeuk Kim (Hanyang University)
Category: cs.CL (Computational Linguistics)
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10913

Abstract

Large language models (LLMs) have made significant progress in expressing confidence through natural language, enhancing transparency and reliability. However, their confidence estimates often exhibit overconfidence bias, whose underlying causes remain insufficiently understood. This study provides a detailed analysis of the intrinsic dynamics of verbalized confidence, identifying "answer-independence" as a key factor—the model's failure to modulate confidence based on its own generated answers. To address this issue, the authors propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that promotes answer-dependent confidence estimation. Extensive experiments demonstrate that ADVICE significantly improves confidence calibration while maintaining task performance. Further analysis confirms that ADVICE enhances answer-dependency, producing more balanced and well-calibrated confidence distributions.

Research Background and Motivation

Problem Definition

Core Problem: Large language models exhibit severe overconfidence bias when generating verbalized confidence, tending to express high confidence regardless of answer correctness
Significance: When deploying LLMs in high-risk domains such as law and medicine, reliable confidence estimation is crucial for managing the model's inherent limitations
Limitations of Existing Approaches:
- Existing research primarily focuses on "how" to mitigate overconfidence rather than "why" it occurs
- Lack of deep understanding of the intrinsic mechanisms of verbalized confidence
- While prompting methods, sampling methods, and fine-tuning approaches show improvements, the underlying causes remain unclear

Research Motivation

Inspired by confidence estimation theories in neuroscience, the authors frame confidence estimation as a post-decision evidence accumulation process, discovering that LLMs often ignore information from their own generated answers when estimating confidence, which contradicts the definition of confidence.

Core Contributions

Theoretical Finding: First systematically identifies and analyzes "answer-independence" as the fundamental cause of LLM overconfidence
Analysis Method: Proposes a dual verification approach based on probability distribution comparison and attribution analysis to quantify answer-dependency
Solution: Designs the ADVICE fine-tuning framework that explicitly encourages the model to focus on its generated answers when reporting confidence
Empirical Validation: Validates the method's effectiveness across multiple datasets and models, demonstrating the importance of answer information in confidence estimation
Generalization Capability: Demonstrates strong generalization ability on out-of-distribution tasks and balanced confidence distribution characteristics

Methodology Details

Task Definition

Given a question q and corresponding answer a, verbalized confidence should approximate the probability that the answer is correct: P(correct|q,a). Ideal confidence estimation should:

Express high confidence when the answer is correct
Express low confidence when the answer is incorrect
Adjust confidence levels based on answer content

Answer-Independence Analysis

1. Probability Distribution Comparison Method

Verifies answer-independence by comparing the following two distributions:

P_M(C | q, a) ≈ P_M(C | q) ∀a ∈ A_q

where the right-hand side is expanded via the law of total probability:

P_M(C | q) = Σ_{a'∈A_q} P_M(C | q, a') P_M(a' | q)

Uses Jensen-Shannon divergence (JSD) to quantify the difference between the two distributions; JSD values close to 0 indicate the model is insensitive to answer information.

2. Attribution Analysis Method

Attention Rollout: Analyzes attention weights of confidence generation toward answer tokens
Integrated Gradients: Computes the contribution of answer tokens to confidence prediction

ADVICE Framework Design

Training Data Construction

Sample 2000 instances from TriviaQA
For each question q, construct triplets (q, a_correct, a_wrong)
Construct three linguistic format variants to enhance generalization

Training Objectives

Define three loss functions:

Language Modeling Loss:

L_LM = (1/|a_correct|) Σ_{x_t∈a_correct} -log P(x_t | x_<t)

Preserves the model's original QA capability

Contrastive Distribution Loss:

L_JSD = max(0, δ_JSD - D_JSD(P_correct || P_wrong))

Drives the model to learn to distinguish confidence distributions between correct and incorrect answers

Margin Loss:

L_Margin = max(0, δ_Margin - (μ_correct - μ_wrong))

Ensures correct answers receive higher expected confidence

Total loss function:

L = λ_LM L_LM + λ_JSD L_JSD + λ_Margin L_Margin

Technical Innovations

Root Cause Analysis: First analyzes overconfidence from the perspective of answer-dependency
Dual Verification: Combines probabilistic analysis and neural network attribution methods to validate hypotheses
Contrastive Learning: Employs contrastive training using correct/incorrect answer pairs
Multi-objective Optimization: Balances task performance preservation and confidence calibration improvement

Experimental Setup

Datasets

Training: TriviaQA (2000 instances)
Evaluation: TriviaQA, MMLU, SciQ, LogiQA (testing cross-domain generalization)

Models

LLAMA-3.1-8B-INSTRUCT
MISTRAL-7B-INSTRUCT-V0.3
GEMMA-2-9B-IT

Confidence Expression Types

ScoreText: {low, medium, high}
ScoreLetter: {E, D, C, B, A}
ScoreNumber: {0, 1, ..., 9}
ScoreFloat: 0.0, 1.0
ScorePercent: {0%, 1%, ..., 100%}

Evaluation Metrics

ECE (Expected Calibration Error): Average absolute difference between predicted confidence and actual accuracy
NCE (Net Calibration Error): Signed calibration error reflecting bias
BS (Brier Score): Mean squared error of probability predictions
AUROC: Confidence ranking ability

Baseline Methods

Default: Basic prompting method
Self-Consistency: Sampling-based method
ConfTuner: Current state-of-the-art fine-tuning method

Experimental Results

Main Results

Performance comparison on TriviaQA (GEMMA-2-9B-IT):

ECE: Default (21.9%) → ADVICE (6.5%)
NCE: Default (-21.8%) → ADVICE (1.6%)
AUROC: Default (52.7%) → ADVICE (78.5%)

Cross-domain generalization results show ADVICE achieves significant improvements on MMLU, SciQ, and LogiQA, demonstrating the method's robustness.

Ablation Studies

Analysis of loss function contributions:

L_JSD alone: ECE reduced from 19.7% to 4.9%
L_Margin alone: ECE reduced from 19.7% to 3.9%
Complete ADVICE: Best cross-dataset generalization capability

Key Findings

Answer-Independence Verification: JSD distributions exhibit power-law patterns with most values close to 0, confirming the answer-independence hypothesis
Attention Patterns: Attention weights from confidence to answers are significantly lower than other directions
Calibration Improvement: Reliability diagrams show ADVICE produces finer-grained and more accurate confidence distributions
Answer Awareness Enhancement: Masking experiments show ADVICE appropriately expresses uncertainty when answers are absent

Hyperparameter Analysis

Increasing δ_JSD continuously reduces ECE, validating the effectiveness of the contrastive learning objective.

Verbalized Confidence Research

Lin et al. (2022) first introduced verbalized confidence estimation
Subsequent research primarily divides into three categories: prompting methods, sampling methods, and fine-tuning methods
This research fills the gap in mechanism analysis

LLM Probing Methods

Attention mechanism analysis: Attention Rollout, Attention Flow, etc.
Gradient attribution methods: Integrated Gradients, etc.
This research innovatively applies these methods to confidence analysis

Conclusions and Discussion

Main Conclusions

LLM overconfidence primarily stems from answer-independence issues
ADVICE effectively improves confidence calibration by enhancing answer-dependency
The method demonstrates good generalization capability and practical value

Limitations

Primarily focuses on short-text QA tasks; applicability to long-text comprehension tasks requires further verification
Requires additional data construction costs to generate contrastive answer pairs
Effectiveness on complex reasoning tasks requires further exploration

Future Directions

Extend to tasks requiring long-context understanding and complex reasoning
Explore more efficient training data construction methods
Investigate applications in other modalities (e.g., vision-language models)

In-Depth Evaluation

Strengths

Outstanding Theoretical Contribution: First systematically analyzes the root cause of overconfidence, providing important theoretical insights
Rigorous Methodology: Employs multi-perspective verification (probabilistic analysis + attribution analysis) with high credibility
Well-Designed Experiments: Comprehensive evaluation across models and datasets with thorough ablation studies
Significant Practical Value: Significantly improves confidence calibration while maintaining task performance
Strong Generalization: Performs well on out-of-distribution data, demonstrating method robustness

Weaknesses

Limited Task Scope: Primarily validates QA tasks; applicability to other NLP tasks insufficiently explored
Computational Overhead: Requires additional fine-tuning and contrastive data construction
Insufficient Theoretical Depth: While identifying answer-independence, analysis of its deeper causes is incomplete
Long-term Effects: Does not evaluate model stability in long-term use after fine-tuning

Impact

Academic Value: Provides new research perspectives and analytical frameworks for the confidence estimation field
Practical Significance: Important for improving LLM reliability in high-risk applications
Reproducibility: Provides detailed implementation details and open-source code for easy reproduction and extension

Applicable Scenarios

Question-answering systems requiring reliable confidence estimation
High-risk decision support systems
Uncertainty expression in human-machine collaboration scenarios
Model calibration and trustworthy AI applications

References

The paper cites 68 relevant references covering multiple fields including verbalized confidence, LLM probing methods, and calibration theory, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality research paper with important contributions in both theoretical analysis and practical methodology. The authors not only identify the root cause of LLM overconfidence but also propose an effective solution. The method is simple yet effective, the experimental design is rigorous, and the results are convincing. It has significant importance for advancing trustworthy AI and improving LLM reliability in practical applications.