2025-11-14T14:40:10.381409

Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models

Hong, Zhang, Jiang et al.
Retrieval-based chatbots leverage human-verified Q\&A knowledge to deliver accurate, verifiable responses, making them ideal for customer-centric applications where compliance with regulatory and operational standards is critical. To effectively handle diverse customer inquiries, augmenting the knowledge base with "similar questions" that retain semantic meaning while incorporating varied expressions is a cost-effective strategy. In this paper, we introduce the Similar Question Generation (SQG) task for LLM training and inference, proposing context-aware approaches to enable comprehensive semantic exploration and enhanced alignment with source question-answer relationships. We formulate optimization techniques for constructing in-context prompts and selecting an optimal subset of similar questions to expand chatbot knowledge under budget constraints. Both quantitative and human evaluations validate the effectiveness of these methods, achieving a 92% user satisfaction rate in a deployed chatbot system, reflecting an 18% improvement over the unaugmented baseline. These findings highlight the practical benefits of SQG and emphasize the potential of LLMs, not as direct chatbot interfaces, but in supporting non-generative systems for hallucination-free, compliance-guaranteed applications.
academic

Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models

Basic Information

  • Paper ID: 2410.12444
  • Title: Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models
  • Authors: Mengze Hong, Chen Jason Zhang, Di Jiang, Yuanqin He
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 2024
  • Institutions: The Hong Kong Polytechnic University, WeBank AI Team
  • Paper Link: https://arxiv.org/abs/2410.12444v3

Abstract

Retrieval-based chatbots leverage human-verified question-answer knowledge bases to provide accurate and verifiable responses, making them well-suited for customer service applications requiring compliance with regulatory and operational standards. To effectively handle diverse customer queries, augmenting the knowledge base by generating "similar questions" that maintain semantic consistency while exhibiting expression diversity is a cost-effective strategy. This paper introduces the Similar Question Generation (SQG) task for large language model training and inference, proposing context-aware approaches to achieve comprehensive semantic exploration and enhanced alignment with source question-answer relationships. The research establishes optimization techniques for constructing context prompts and selecting optimal similar question subsets under budget constraints. Quantitative and human evaluations validate the effectiveness of these approaches, achieving 92% user satisfaction in a deployed chatbot system, representing an 18% improvement over the unaugmented baseline.

Research Background and Motivation

Problem Definition

  1. Core Issue: Traditional retrieval-based customer service chatbots suffer from matching failures when handling customer queries with diverse expressions, resulting in poor user experience
  2. Application Scenario Importance: In highly regulated industries such as finance and healthcare, generative large language models are prone to hallucinations and cannot meet compliance requirements
  3. Limitations of Existing Methods:
    • Manual crowdsourcing is costly and offers limited diversity
    • Rule-based methods (e.g., SimBERT, RoFormer-Sim) lack context-awareness capabilities
    • Standard sequence-to-sequence approaches struggle to produce diverse questions

Research Motivation

This research aims to leverage the generative capabilities of large language models to augment knowledge bases for retrieval-based chatbots rather than using them as direct dialogue interfaces, thereby improving query matching performance while ensuring compliance.

Core Contributions

  1. First Definition of SQG Task: Formulates the Similar Question Generation task for retrieval-based service chatbot enhancement, proposing a context-aware one-to-many generation paradigm
  2. Optimization Framework: Proposes optimization techniques under budget constraints for selecting prompt examples and similar question subsets to facilitate knowledge base expansion
  3. Significant Performance Improvements: Experiments demonstrate over 120% relative improvement in qualitative evaluation, 4.74% overall diversity improvement, and 18% user satisfaction increase
  4. Real-World Deployment Verification: Validates method effectiveness through deployment and verification in an actual banking customer service system

Methodology Details

Task Definition

Similar Question Generation (SQG) aims to create diverse yet semantically consistent question sets for specific answers in the knowledge base. Key requirements include:

  • Semantic Consistency: Preserving original intent and meaning
  • Syntactic Diversity: Variations in wording and structure

Model Architecture

1. Context-Aware Batch Generation

Traditional one-to-one paradigm → one-to-many paradigm
Input: Source question
Output: K similar questions

Training objective extends from single question pairs to batch generation:

L_ft = -∑_j ∑_i log(P_Φ(q_j|q_i))

2. Intent-Enhanced Batch Generation

By introducing source answers as contextual prior knowledge:

Input: (source question, source answer)
Output: {similar question 1, ..., similar question K}

Refined training objective:

L_Intention = ∑_i ∑_j ∑_{l=1}^L L_{j+l}(q_i, a)

where generation of each target question is based on the original question-answer pair and previously generated similar questions.

Optimization Framework

1. Dynamic Example Selection Algorithm (QSM)

Objective function:

arg max_{P⊆D,|P|=K} [∑_{i=1}^K S(q_s, q_{p_i}) + α/K ∑_{i≠j} dist(q_{p_i}, q_{p_j})]

Balancing relevance and diversity, where S is cosine similarity and dist is Euclidean distance.

2. Similar Question Subset Selection

Constrained optimization problem:

max_{S⊆Q*} ∑_{q_a,q_b∈S, q_a≠q_b} dist(q_a, q_b)
s.t. ∑_{q∈S} cost(q) ≤ B

By proving NP-hardness of this problem and submodularity of the objective function, a greedy algorithm with 1-1/e approximation guarantee is proposed.

Technical Innovations

  1. Autoregressive Context Guidance: Leveraging LLM's autoregressive nature by using previously generated questions as context for subsequent generation
  2. Intent-Aware Generation: Extending semantic exploration space by incorporating source answers
  3. Budget-Constrained Optimization: Providing flexible resource management mechanisms adaptable to different deployment scenarios

Experimental Setup

Datasets

  • Primary Dataset: 3000+ Chinese question-answer pairs from financial industry customer service chatbots
  • Training Set: 90,000 instances
  • Test Set: 90 unseen question-answer pairs with average 45 reference questions
  • Human Evaluation: 15 new questions for real-world use case assessment

Evaluation Metrics

Semantic Relevance

  • Precision: Maximum BERTScore between generated and reference questions
  • Recall: Maximum BERTScore between reference and generated questions
  • F1 Score: Harmonic mean of precision and recall

Character-Level Diversity

  • Distinct-N: Proportion of unique N-grams in generated questions
  • Distinct-Avg: Average of Distinct-1 and Distinct-2

Qualitative Evaluation

Acceptance rate evaluated by 5 industry experts based on semantic consistency and syntactic diversity criteria.

Comparison Methods

  • SimBERT, RoFormer-Sim (rule-based methods)
  • ChatGLM2 zero-shot and few-shot learning
  • ChatGLM2 fine-tuning (one-to-one objective)

Implementation Details

  • Base Model: ChatGLM2-6B
  • Hardware: NVIDIA A100 GPU
  • Training Approach: Full-parameter fine-tuning
  • Generation Quantity: L=20

Experimental Results

Main Results

MethodPrecisionRecallF1 ScoreDistinct-AvgAcceptance Rate
SimBERT0.86220.77440.81600.156218.3%
RoFormer-Sim0.85740.77040.81150.207320.0%
ChatGLM2-FT0.85760.81410.83520.291037.9%
Context-Aware0.86280.83770.85050.280045.0%
Intention-Enhanced0.86220.83900.85040.271884.0%
+ Dynamic Example Selection0.86120.85270.85690.286682.0%

Key Findings

  1. Significant Intent Enhancement Effect: The intent-enhanced method achieves 84% acceptance rate in human evaluation, representing a 121.64% improvement over baseline methods
  2. Scale Effects: The proposed method maintains stable precision as the number of generated questions increases, while baseline methods show significant degradation
  3. Real Deployment Performance: Achieves 92% user satisfaction in actual banking applications, representing an 18% improvement over unaugmented baseline

Ablation Studies

Impact of Generation Quantity on Performance

  • Intent-enhanced method maintains high precision when generating 100 questions
  • Recall improves from 0.82 to 0.89
  • Generating just 10 questions surpasses baseline performance with 100 generated questions

Selection Algorithm Effectiveness

Greedy selection algorithm shows significant diversity improvements over random selection:

  • Selecting 5 from 20 questions: diversity improves from 4.37 to 5.15
  • Selecting 10 from 20 questions: diversity improves from 20.14 to 22.31

Case Analysis

Using certificate processing time query as an example:

Source Question: How long does it take to issue a certificate?

SimBERT Generation:

  • High precision: How long does it take to issue a certificate?
  • Low precision: How do I issue a company certificate? (off-topic)

Intent-Enhanced Generation:

  • High precision: How long does certificate issuance take?
  • Low precision: Can I issue an electronic certificate today? (demonstrates learning "electronic certificate" concept from the answer)

Data Augmentation Methods

  1. Traditional Methods: Manual crowdsourcing, rule-based automation
  2. Deep Learning Methods: SimBERT, RoFormer-Sim and other pre-trained models
  3. Large Language Models: Data augmentation through prompting and fine-tuning

Retrieval-Based Chatbots

  1. Matching-Response Framework: Using human-verified question-answer pairs to ensure accuracy
  2. Query Matching Optimization: Improving matching performance through knowledge base expansion

Contributions of This Work

Compared to existing work, this paper is the first to systematically apply large language models to knowledge base augmentation for retrieval-based chatbots, proposing specialized training objectives and optimization frameworks.

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: The context-aware one-to-many generation paradigm significantly outperforms traditional methods
  2. Importance of Intent Guidance: Incorporating source answers as context substantially improves generation quality and diversity
  3. Practical Value: Validates commercial value through real-world deployment
  4. New Role for LLMs: Demonstrates LLM's potential as an auxiliary tool rather than a direct interface

Limitations

  1. Monolingual Assumption: Current method assumes customer queries are monolingual, not considering multilingual and code-switching scenarios
  2. Evaluation Cost: High cost of human evaluation with limited scalability
  3. Domain Dependency: Method validated in specific domain (finance), generalization capability requires further verification

Future Directions

  1. Multilingual Support: Extension to multilingual and cross-lingual scenarios
  2. LLM-Based Evaluation: Using LLM-as-a-judge to replace human evaluation
  3. Larger-Scale Validation: Verifying method effectiveness across more domains and scenarios

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: First systematic definition of the SQG task, filling a research gap
  2. Strong Method Innovation:
    • One-to-many generation paradigm effectively leverages LLM's autoregressive nature
    • Intent-enhancement design is ingenious, significantly improving generation quality
    • Optimization framework considers practical deployment constraints
  3. Comprehensive Experiments:
    • Multi-dimensional evaluation metrics
    • Real dataset validation
    • Real-world deployment performance verification
  4. High Practical Value: Addresses pain points in highly regulated industries

Limitations

  1. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why one-to-many paradigm is more effective
  2. Dataset Limitations: Primarily validated in Chinese financial domain, cross-lingual and cross-domain generalization not sufficiently verified
  3. Computational Cost Analysis: Lacks detailed analysis of training and inference computational costs
  4. Long-Term Effects Unknown: Lacks tracking analysis of long-term deployment effects

Impact

  1. Academic Contribution: Provides new insights for LLM application in retrieval-based systems
  2. Industrial Value: Offers practical solutions for customer service scenarios with high compliance requirements
  3. Method Reproducibility: Provides detailed implementation details and algorithm descriptions

Applicable Scenarios

  1. High-Compliance Industries: Finance, healthcare, legal and other domains requiring accuracy guarantees
  2. Multilingual Customer Service: Extendable to multilingual customer support systems
  3. Knowledge Base Maintenance: Scenarios requiring efficient expansion and maintenance of question-answer knowledge bases
  4. Retrieval-Augmented Systems: Various retrieval systems needing improved query matching performance

References

The paper cites multiple important related works, including:

  • Data augmentation methods: Wei et al. (2022), Liu et al. (2023)
  • Retrieval-based chatbots: Wu et al. (2018), Singh et al. (2018)
  • Large language model applications: Vaswani et al. (2017), Cheng et al. (2023)
  • Evaluation methods: Zhang et al. (2020), Li et al. (2016)

Overall Assessment: This is a high-quality applied research paper that achieves good balance between theoretical innovation and practical value. The method design is sound, experimental validation is comprehensive, and deployment verification in real business environments strengthens the paper's persuasiveness. It holds significant reference value for AI applications requiring compliance guarantees.