2025-11-18T04:19:13.869286

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Huang, Datla, Zhu et al.

We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.

academic

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Basic Information

Paper ID: 2510.13750
Title: Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation
Authors: Zhiqi Huang, Vivek Datla, Chenyang Zhu, Alfy Samuel, Daben Liu, Anoop Kumar, Ritesh Soni (Capital One)
Category: cs.CL (Computational Linguistics)
Publication Date: October 16, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2510.13750v2

Abstract

This paper proposes a confidence estimation method for Retrieval-Augmented Generation (RAG) systems that is closely correlated with the correctness of Large Language Model (LLM) outputs. Confidence estimation is particularly important in high-risk domains such as finance and healthcare, where the cost of providing incorrect answers far exceeds the cost of not answering questions. The method extends existing uncertainty quantification approaches by leveraging raw feedforward network (FFN) activations as autoregressive signals, avoiding the inherent information loss in token logits and probabilities after projection and softmax normalization. The authors model confidence prediction as a sequence classification task and employ Huber loss regularization during training to improve robustness to noisy supervision. In real-world financial industry customer support scenarios with complex knowledge bases, the method outperforms strong baselines while maintaining high accuracy under strict latency constraints.

Research Background and Motivation

Problem Definition

In high-risk application scenarios, RAG systems should refuse to answer rather than provide incorrect responses. This requires a confidence measure that is strongly correlated with response correctness, allowing the system to mask responses when the confidence score falls below a threshold.

Problem Importance

High-risk domain requirements: In strictly regulated domains such as finance and healthcare, the reputational and financial costs of providing incorrect answers far exceed the costs of not providing answers
Real-time deployment challenges: Existing methods perform poorly on lengthy narrative responses and under the latency requirements of production environments
Sources of uncertainty: Primarily epistemic uncertainty (insufficient model knowledge) rather than aleatoric uncertainty (inherent data randomness)

Limitations of Existing Methods

Sampling-based methods: Require multiple generations, introducing excessive computational costs and latency in production environments
Token probability methods: Perform poorly on long responses; a single low-probability word may disproportionately reduce the overall sequence score
Information loss: Token probabilities lose rich internal representation information after linear projection and softmax transformation

Core Contributions

Proposes activation-based confidence estimation method: Leverages raw FFN activations as autoregressive signals, avoiding information loss from token logits
Sequence classification framework: Models confidence prediction as a sequence classification task using LSTM to process activation sequences
Huber loss regularization: Introduces Huber loss to improve robustness to noisy supervision from the retrieval stage
Production environment validation: Verifies method effectiveness and scalability in real financial customer support scenarios
Efficiency optimization: Demonstrates that using only layer 16 activations significantly reduces latency while maintaining accuracy

Methodology Details

Task Definition

Given input x and generated sequence s, the objective is to estimate a confidence score c that is strongly correlated with response correctness. When c falls below a threshold, the system refuses to display the response.

Model Architecture

Overall Framework

Input sequence construction:

x = xI ⊕ xQ ⊕ xC ⊕ s ⊕ xEOS

where xI (instruction), xQ (question), xC (context), s (answer), xEOS (end-of-sequence token)

Activation Extraction

Extract hidden state activations from Transformer layer ℓ:

Hℓ = (h¹ℓ, ..., h^(T+L+1)ℓ)

Retain only activations corresponding to the answer portion:

Sin = (h^(T+1)ℓ, h^(T+2)ℓ, ..., h^(T+L+1)ℓ)

Sequence Classifier

Use LSTM as sequence classifier g(Sin), outputting a 2-dimensional logit vector z, with confidence score:

c = softmax(z)₁ = e^z₁/(e^z₀ + e^z₁)

Training Strategy

Loss Function

Combines cross-entropy loss with Huber loss regularization:

LTotal = LCE + λLHuber

Huber loss defined as:

Hδ(x) = {
  ½x² for |x| ≤ δ
  δ(|x| - ½δ) otherwise
}

Batch-level Huber loss:

LHuber = Hδ(1/|B| Σci - 1/|B| ΣI(ŷi = yi))

Technical Innovations

Raw activations vs. token probabilities: Avoids information compression and distortion caused by linear projection and softmax
Autoregressive sequence modeling: Uses LSTM to capture temporal dependencies in the generation process
Robustness regularization: Huber loss is more robust to noisy labels introduced by retrieval errors
Layer-wise optimization: Determines optimal activation extraction layer through experimentation

Experimental Setup

Dataset

Source: Capital One internal financial customer support knowledge base
Scale: 8.5k documents, approximately 45k chunks
Characteristics: Semi-structured documents with complex hierarchical structures, tables, lists, etc.
Annotation: Two-layer validation mechanism through real-time feedback and SME expert assessment

Evaluation Metrics

AUROC: Discriminative ability of confidence scores
Precision (P): Accuracy of displayed responses
Recall (R): Recall rate of correct responses
ROUGE-L: Response quality assessment
Mask Rate: Proportion of masked responses
Latency: Average and P99 response times

Baseline Methods

Vectara (HHEM2.1): Semantic consistency model based on entailment
VectaraFT: Fine-tuned version of Vectara
Logits-based: Uncertainty model based on token logits

Implementation Details

Model: Llama 3.1 8B
Activation layers: Layer 16 and Layer 32
Context size: Top-1, Top-3, Top-5, Full (Top-7)
Inference framework: Hugging Face, vLLM

Experimental Results

Main Results

Method	AUROC
Vectara	0.590
VectaraFT	0.634
Logits-based	0.663
Our Model (no calib.)	0.741
Our Model (with calib.)	0.772

Confidence Threshold Analysis

Threshold	Precision	Recall	ROUGE-L (Shown/Masked)	Mask Rate
0.5	0.95	0.73	0.65/0.57	29.9%
0.7	0.96	0.65	0.66/0.57	38.6%
0.9	0.97	0.52	0.67/0.58	52.0%

Layer and Context Optimization

Layer 16 vs. Layer 32:

Layer 16 significantly reduces latency (approximately 42.5%) while maintaining similar performance
Under Full context setting, Layer 16 achieves 0.97 precision with 31.3% mask rate

Latency Analysis:

Framework	Layer	Context	Avg Latency (ms)	P99 Latency (ms)
vLLM	16	Full	127	267
vLLM	32	Full	206	354

Ablation Studies

Effect of Huber loss: Improves from 0.741 to 0.772 AUROC
Activation layer selection: Layer 16 performance approaches Layer 32 with lower latency
Context size impact: Larger context improves accuracy but increases latency

Classification of Uncertainty Quantification Methods

Sampling-based methods: Measure consistency through multiple generations, but computationally expensive
Probability-based methods: Utilize token probabilities and semantic entropy, but limited effectiveness on long texts
Classification-based methods: Such as HHEM, avoid multiple generations but require black-box access
Activation-based methods: Leverage internal representations, the main contribution direction of this work

Advantages of This Work

Compared to sampling methods: Single forward pass with lower latency
Compared to probability methods: Preserves complete internal representations with minimal information loss
Compared to black-box methods: Leverages white-box access to obtain richer signals

Conclusions and Discussion

Main Conclusions

Effectiveness: Activation-based method significantly outperforms existing baselines with AUROC of 0.772
Practicality: Achieves good balance of 0.95 precision and 29.9% mask rate in production environments
Efficiency: Layer 16 activations substantially reduce latency while maintaining performance
Robustness: Huber loss effectively improves robustness to noisy supervision

Limitations

White-box dependency: Requires access to model internal activations, limiting generalizability
Architecture-specific: Method is customized for specific model architectures; transfer requires reconfiguration
Two-stage processing: Requires additional forward pass to compute confidence scores
Data limitations: Experimental data cannot be made public, affecting reproducibility

Future Directions

End-to-end integration: Directly integrate confidence estimation into the generation process
Architecture-agnostic: Develop universal methods applicable to multiple LLM architectures
Computational optimization: Further reduce computational overhead of confidence estimation
Theoretical analysis: Deepen understanding of theoretical relationships between activation patterns and confidence

In-Depth Evaluation

Strengths

Technical innovation: First systematic exploitation of FFN activations for RAG confidence estimation, avoiding information loss from token probabilities
Practical value: Validated in real financial scenarios with strong practical orientation
Comprehensive experiments: Thorough ablation studies across multiple dimensions (layers, context, latency)
Engineering considerations: Adequately addresses production environment latency constraints and scalability requirements

Weaknesses

Generalizability limitations: Method depends on white-box access and specific architecture, limiting transferability
Theoretical foundation: Lacks in-depth theoretical analysis of why FFN activations can predict confidence
Data transparency: Proprietary dataset cannot be made public, affecting result verifiability
Limited comparisons: Insufficient comparison with more recent uncertainty quantification methods

Impact

Academic contribution: Provides new technical pathway for trustworthiness research in RAG systems
Industrial value: Offers practical solution for LLM deployment in high-risk domains
Methodological inspiration: Activation-based approach may inspire more research on internal representation utilization

Applicable Scenarios

High-risk domains: Finance, healthcare, law and other scenarios with extremely high accuracy requirements
White-box deployment: Enterprise applications with model internal access
Real-time systems: Scenarios requiring trustworthy responses under strict latency constraints
Professional knowledge bases: RAG applications with structured, specialized knowledge bases

References

This paper cites important works from multiple related fields including uncertainty quantification, RAG systems, and activation analysis, including:

Azaria and Mitchell (2023): LLM internal states and "lying" detection
Bakman et al. (2024): Semantic-based response scoring
Bao et al. (2024): HHEM entailment model
Dai et al. (2022): Knowledge neurons in pre-trained Transformers

Overall Assessment: This is a technically sound and highly practical paper that proposes an innovative solution to the important problem of confidence estimation in RAG systems. While it has certain limitations in generalizability and theoretical depth, its successful application in real-world scenarios and comprehensive experimental validation make it valuable for both academic and industrial purposes.