Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation
Huang, Datla, Zhu et al.
We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
academic
Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation
This paper proposes a confidence estimation method for Retrieval-Augmented Generation (RAG) systems that is closely correlated with the correctness of Large Language Model (LLM) outputs. Confidence estimation is particularly important in high-risk domains such as finance and healthcare, where the cost of providing incorrect answers far exceeds the cost of not answering questions. The method extends existing uncertainty quantification approaches by leveraging raw feedforward network (FFN) activations as autoregressive signals, avoiding the inherent information loss in token logits and probabilities after projection and softmax normalization. The authors model confidence prediction as a sequence classification task and employ Huber loss regularization during training to improve robustness to noisy supervision. In real-world financial industry customer support scenarios with complex knowledge bases, the method outperforms strong baselines while maintaining high accuracy under strict latency constraints.
In high-risk application scenarios, RAG systems should refuse to answer rather than provide incorrect responses. This requires a confidence measure that is strongly correlated with response correctness, allowing the system to mask responses when the confidence score falls below a threshold.
High-risk domain requirements: In strictly regulated domains such as finance and healthcare, the reputational and financial costs of providing incorrect answers far exceed the costs of not providing answers
Real-time deployment challenges: Existing methods perform poorly on lengthy narrative responses and under the latency requirements of production environments
Sources of uncertainty: Primarily epistemic uncertainty (insufficient model knowledge) rather than aleatoric uncertainty (inherent data randomness)
Proposes activation-based confidence estimation method: Leverages raw FFN activations as autoregressive signals, avoiding information loss from token logits
Sequence classification framework: Models confidence prediction as a sequence classification task using LSTM to process activation sequences
Huber loss regularization: Introduces Huber loss to improve robustness to noisy supervision from the retrieval stage
Production environment validation: Verifies method effectiveness and scalability in real financial customer support scenarios
Efficiency optimization: Demonstrates that using only layer 16 activations significantly reduces latency while maintaining accuracy
Given input x and generated sequence s, the objective is to estimate a confidence score c that is strongly correlated with response correctness. When c falls below a threshold, the system refuses to display the response.
Technical innovation: First systematic exploitation of FFN activations for RAG confidence estimation, avoiding information loss from token probabilities
Practical value: Validated in real financial scenarios with strong practical orientation
This paper cites important works from multiple related fields including uncertainty quantification, RAG systems, and activation analysis, including:
Azaria and Mitchell (2023): LLM internal states and "lying" detection
Bakman et al. (2024): Semantic-based response scoring
Bao et al. (2024): HHEM entailment model
Dai et al. (2022): Knowledge neurons in pre-trained Transformers
Overall Assessment: This is a technically sound and highly practical paper that proposes an innovative solution to the important problem of confidence estimation in RAG systems. While it has certain limitations in generalizability and theoretical depth, its successful application in real-world scenarios and comprehensive experimental validation make it valuable for both academic and industrial purposes.