2025-11-18T04:19:13.869286

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Huang, Datla, Zhu et al.
We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
academic

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Basic Information

  • Paper ID: 2510.13750
  • Title: Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation
  • Authors: Zhiqi Huang, Vivek Datla, Chenyang Zhu, Alfy Samuel, Daben Liu, Anoop Kumar, Ritesh Soni (Capital One)
  • Category: cs.CL (Computational Linguistics)
  • Publication Date: October 16, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.13750v2

Abstract

This paper proposes a confidence estimation method for Retrieval-Augmented Generation (RAG) systems that is closely correlated with the correctness of Large Language Model (LLM) outputs. Confidence estimation is particularly important in high-risk domains such as finance and healthcare, where the cost of providing incorrect answers far exceeds the cost of not answering questions. The method extends existing uncertainty quantification approaches by leveraging raw feedforward network (FFN) activations as autoregressive signals, avoiding the inherent information loss in token logits and probabilities after projection and softmax normalization. The authors model confidence prediction as a sequence classification task and employ Huber loss regularization during training to improve robustness to noisy supervision. In real-world financial industry customer support scenarios with complex knowledge bases, the method outperforms strong baselines while maintaining high accuracy under strict latency constraints.

Research Background and Motivation

Problem Definition

In high-risk application scenarios, RAG systems should refuse to answer rather than provide incorrect responses. This requires a confidence measure that is strongly correlated with response correctness, allowing the system to mask responses when the confidence score falls below a threshold.

Problem Importance

  1. High-risk domain requirements: In strictly regulated domains such as finance and healthcare, the reputational and financial costs of providing incorrect answers far exceed the costs of not providing answers
  2. Real-time deployment challenges: Existing methods perform poorly on lengthy narrative responses and under the latency requirements of production environments
  3. Sources of uncertainty: Primarily epistemic uncertainty (insufficient model knowledge) rather than aleatoric uncertainty (inherent data randomness)

Limitations of Existing Methods

  1. Sampling-based methods: Require multiple generations, introducing excessive computational costs and latency in production environments
  2. Token probability methods: Perform poorly on long responses; a single low-probability word may disproportionately reduce the overall sequence score
  3. Information loss: Token probabilities lose rich internal representation information after linear projection and softmax transformation

Core Contributions

  1. Proposes activation-based confidence estimation method: Leverages raw FFN activations as autoregressive signals, avoiding information loss from token logits
  2. Sequence classification framework: Models confidence prediction as a sequence classification task using LSTM to process activation sequences
  3. Huber loss regularization: Introduces Huber loss to improve robustness to noisy supervision from the retrieval stage
  4. Production environment validation: Verifies method effectiveness and scalability in real financial customer support scenarios
  5. Efficiency optimization: Demonstrates that using only layer 16 activations significantly reduces latency while maintaining accuracy

Methodology Details

Task Definition

Given input x and generated sequence s, the objective is to estimate a confidence score c that is strongly correlated with response correctness. When c falls below a threshold, the system refuses to display the response.

Model Architecture

Overall Framework

Input sequence construction:

x = xI ⊕ xQ ⊕ xC ⊕ s ⊕ xEOS

where xI (instruction), xQ (question), xC (context), s (answer), xEOS (end-of-sequence token)

Activation Extraction

Extract hidden state activations from Transformer layer ℓ:

Hℓ = (h¹ℓ, ..., h^(T+L+1)ℓ)

Retain only activations corresponding to the answer portion:

Sin = (h^(T+1)ℓ, h^(T+2)ℓ, ..., h^(T+L+1)ℓ)

Sequence Classifier

Use LSTM as sequence classifier g(Sin), outputting a 2-dimensional logit vector z, with confidence score:

c = softmax(z)₁ = e^z₁/(e^z₀ + e^z₁)

Training Strategy

Loss Function

Combines cross-entropy loss with Huber loss regularization:

LTotal = LCE + λLHuber

Huber loss defined as:

Hδ(x) = {
  ½x² for |x| ≤ δ
  δ(|x| - ½δ) otherwise
}

Batch-level Huber loss:

LHuber = Hδ(1/|B| Σci - 1/|B| ΣI(ŷi = yi))

Technical Innovations

  1. Raw activations vs. token probabilities: Avoids information compression and distortion caused by linear projection and softmax
  2. Autoregressive sequence modeling: Uses LSTM to capture temporal dependencies in the generation process
  3. Robustness regularization: Huber loss is more robust to noisy labels introduced by retrieval errors
  4. Layer-wise optimization: Determines optimal activation extraction layer through experimentation

Experimental Setup

Dataset

  • Source: Capital One internal financial customer support knowledge base
  • Scale: 8.5k documents, approximately 45k chunks
  • Characteristics: Semi-structured documents with complex hierarchical structures, tables, lists, etc.
  • Annotation: Two-layer validation mechanism through real-time feedback and SME expert assessment

Evaluation Metrics

  • AUROC: Discriminative ability of confidence scores
  • Precision (P): Accuracy of displayed responses
  • Recall (R): Recall rate of correct responses
  • ROUGE-L: Response quality assessment
  • Mask Rate: Proportion of masked responses
  • Latency: Average and P99 response times

Baseline Methods

  • Vectara (HHEM2.1): Semantic consistency model based on entailment
  • VectaraFT: Fine-tuned version of Vectara
  • Logits-based: Uncertainty model based on token logits

Implementation Details

  • Model: Llama 3.1 8B
  • Activation layers: Layer 16 and Layer 32
  • Context size: Top-1, Top-3, Top-5, Full (Top-7)
  • Inference framework: Hugging Face, vLLM

Experimental Results

Main Results

MethodAUROC
Vectara0.590
VectaraFT0.634
Logits-based0.663
Our Model (no calib.)0.741
Our Model (with calib.)0.772

Confidence Threshold Analysis

ThresholdPrecisionRecallROUGE-L (Shown/Masked)Mask Rate
0.50.950.730.65/0.5729.9%
0.70.960.650.66/0.5738.6%
0.90.970.520.67/0.5852.0%

Layer and Context Optimization

Layer 16 vs. Layer 32:

  • Layer 16 significantly reduces latency (approximately 42.5%) while maintaining similar performance
  • Under Full context setting, Layer 16 achieves 0.97 precision with 31.3% mask rate

Latency Analysis:

FrameworkLayerContextAvg Latency (ms)P99 Latency (ms)
vLLM16Full127267
vLLM32Full206354

Ablation Studies

  1. Effect of Huber loss: Improves from 0.741 to 0.772 AUROC
  2. Activation layer selection: Layer 16 performance approaches Layer 32 with lower latency
  3. Context size impact: Larger context improves accuracy but increases latency

Classification of Uncertainty Quantification Methods

  1. Sampling-based methods: Measure consistency through multiple generations, but computationally expensive
  2. Probability-based methods: Utilize token probabilities and semantic entropy, but limited effectiveness on long texts
  3. Classification-based methods: Such as HHEM, avoid multiple generations but require black-box access
  4. Activation-based methods: Leverage internal representations, the main contribution direction of this work

Advantages of This Work

  • Compared to sampling methods: Single forward pass with lower latency
  • Compared to probability methods: Preserves complete internal representations with minimal information loss
  • Compared to black-box methods: Leverages white-box access to obtain richer signals

Conclusions and Discussion

Main Conclusions

  1. Effectiveness: Activation-based method significantly outperforms existing baselines with AUROC of 0.772
  2. Practicality: Achieves good balance of 0.95 precision and 29.9% mask rate in production environments
  3. Efficiency: Layer 16 activations substantially reduce latency while maintaining performance
  4. Robustness: Huber loss effectively improves robustness to noisy supervision

Limitations

  1. White-box dependency: Requires access to model internal activations, limiting generalizability
  2. Architecture-specific: Method is customized for specific model architectures; transfer requires reconfiguration
  3. Two-stage processing: Requires additional forward pass to compute confidence scores
  4. Data limitations: Experimental data cannot be made public, affecting reproducibility

Future Directions

  1. End-to-end integration: Directly integrate confidence estimation into the generation process
  2. Architecture-agnostic: Develop universal methods applicable to multiple LLM architectures
  3. Computational optimization: Further reduce computational overhead of confidence estimation
  4. Theoretical analysis: Deepen understanding of theoretical relationships between activation patterns and confidence

In-Depth Evaluation

Strengths

  1. Technical innovation: First systematic exploitation of FFN activations for RAG confidence estimation, avoiding information loss from token probabilities
  2. Practical value: Validated in real financial scenarios with strong practical orientation
  3. Comprehensive experiments: Thorough ablation studies across multiple dimensions (layers, context, latency)
  4. Engineering considerations: Adequately addresses production environment latency constraints and scalability requirements

Weaknesses

  1. Generalizability limitations: Method depends on white-box access and specific architecture, limiting transferability
  2. Theoretical foundation: Lacks in-depth theoretical analysis of why FFN activations can predict confidence
  3. Data transparency: Proprietary dataset cannot be made public, affecting result verifiability
  4. Limited comparisons: Insufficient comparison with more recent uncertainty quantification methods

Impact

  1. Academic contribution: Provides new technical pathway for trustworthiness research in RAG systems
  2. Industrial value: Offers practical solution for LLM deployment in high-risk domains
  3. Methodological inspiration: Activation-based approach may inspire more research on internal representation utilization

Applicable Scenarios

  1. High-risk domains: Finance, healthcare, law and other scenarios with extremely high accuracy requirements
  2. White-box deployment: Enterprise applications with model internal access
  3. Real-time systems: Scenarios requiring trustworthy responses under strict latency constraints
  4. Professional knowledge bases: RAG applications with structured, specialized knowledge bases

References

This paper cites important works from multiple related fields including uncertainty quantification, RAG systems, and activation analysis, including:

  • Azaria and Mitchell (2023): LLM internal states and "lying" detection
  • Bakman et al. (2024): Semantic-based response scoring
  • Bao et al. (2024): HHEM entailment model
  • Dai et al. (2022): Knowledge neurons in pre-trained Transformers

Overall Assessment: This is a technically sound and highly practical paper that proposes an innovative solution to the important problem of confidence estimation in RAG systems. While it has certain limitations in generalizability and theoretical depth, its successful application in real-world scenarios and comprehensive experimental validation make it valuable for both academic and industrial purposes.