2025-11-14T14:40:10.381409

Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models

Hong, Zhang, Jiang et al.

Retrieval-based chatbots leverage human-verified Q\&A knowledge to deliver accurate, verifiable responses, making them ideal for customer-centric applications where compliance with regulatory and operational standards is critical. To effectively handle diverse customer inquiries, augmenting the knowledge base with "similar questions" that retain semantic meaning while incorporating varied expressions is a cost-effective strategy. In this paper, we introduce the Similar Question Generation (SQG) task for LLM training and inference, proposing context-aware approaches to enable comprehensive semantic exploration and enhanced alignment with source question-answer relationships. We formulate optimization techniques for constructing in-context prompts and selecting an optimal subset of similar questions to expand chatbot knowledge under budget constraints. Both quantitative and human evaluations validate the effectiveness of these methods, achieving a 92% user satisfaction rate in a deployed chatbot system, reflecting an 18% improvement over the unaugmented baseline. These findings highlight the practical benefits of SQG and emphasize the potential of LLMs, not as direct chatbot interfaces, but in supporting non-generative systems for hallucination-free, compliance-guaranteed applications.

academic

Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models

Basic Information

Paper ID: 2410.12444
Title: Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models
Authors: Mengze Hong, Chen Jason Zhang, Di Jiang, Yuanqin He
Classification: cs.CL (Computational Linguistics)
Publication Date: October 2024
Institutions: The Hong Kong Polytechnic University, WeBank AI Team
Paper Link: https://arxiv.org/abs/2410.12444v3

Abstract

Retrieval-based chatbots leverage human-verified question-answer knowledge bases to provide accurate and verifiable responses, making them well-suited for customer service applications requiring compliance with regulatory and operational standards. To effectively handle diverse customer queries, augmenting the knowledge base by generating "similar questions" that maintain semantic consistency while exhibiting expression diversity is a cost-effective strategy. This paper introduces the Similar Question Generation (SQG) task for large language model training and inference, proposing context-aware approaches to achieve comprehensive semantic exploration and enhanced alignment with source question-answer relationships. The research establishes optimization techniques for constructing context prompts and selecting optimal similar question subsets under budget constraints. Quantitative and human evaluations validate the effectiveness of these approaches, achieving 92% user satisfaction in a deployed chatbot system, representing an 18% improvement over the unaugmented baseline.

Research Background and Motivation

Problem Definition

Core Issue: Traditional retrieval-based customer service chatbots suffer from matching failures when handling customer queries with diverse expressions, resulting in poor user experience
Application Scenario Importance: In highly regulated industries such as finance and healthcare, generative large language models are prone to hallucinations and cannot meet compliance requirements
Limitations of Existing Methods:
- Manual crowdsourcing is costly and offers limited diversity
- Rule-based methods (e.g., SimBERT, RoFormer-Sim) lack context-awareness capabilities
- Standard sequence-to-sequence approaches struggle to produce diverse questions

Research Motivation

This research aims to leverage the generative capabilities of large language models to augment knowledge bases for retrieval-based chatbots rather than using them as direct dialogue interfaces, thereby improving query matching performance while ensuring compliance.

Core Contributions

First Definition of SQG Task: Formulates the Similar Question Generation task for retrieval-based service chatbot enhancement, proposing a context-aware one-to-many generation paradigm
Optimization Framework: Proposes optimization techniques under budget constraints for selecting prompt examples and similar question subsets to facilitate knowledge base expansion
Significant Performance Improvements: Experiments demonstrate over 120% relative improvement in qualitative evaluation, 4.74% overall diversity improvement, and 18% user satisfaction increase
Real-World Deployment Verification: Validates method effectiveness through deployment and verification in an actual banking customer service system

Methodology Details

Task Definition

Similar Question Generation (SQG) aims to create diverse yet semantically consistent question sets for specific answers in the knowledge base. Key requirements include:

Semantic Consistency: Preserving original intent and meaning
Syntactic Diversity: Variations in wording and structure

Model Architecture

1. Context-Aware Batch Generation

Traditional one-to-one paradigm → one-to-many paradigm
Input: Source question
Output: K similar questions

Training objective extends from single question pairs to batch generation:

L_ft = -∑_j ∑_i log(P_Φ(q_j|q_i))

2. Intent-Enhanced Batch Generation

By introducing source answers as contextual prior knowledge:

Input: (source question, source answer)
Output: {similar question 1, ..., similar question K}

Refined training objective:

L_Intention = ∑_i ∑_j ∑_{l=1}^L L_{j+l}(q_i, a)

where generation of each target question is based on the original question-answer pair and previously generated similar questions.

Optimization Framework

1. Dynamic Example Selection Algorithm (QSM)

Objective function:

arg max_{P⊆D,|P|=K} [∑_{i=1}^K S(q_s, q_{p_i}) + α/K ∑_{i≠j} dist(q_{p_i}, q_{p_j})]

Balancing relevance and diversity, where S is cosine similarity and dist is Euclidean distance.

2. Similar Question Subset Selection

Constrained optimization problem:

max_{S⊆Q*} ∑_{q_a,q_b∈S, q_a≠q_b} dist(q_a, q_b)
s.t. ∑_{q∈S} cost(q) ≤ B

By proving NP-hardness of this problem and submodularity of the objective function, a greedy algorithm with 1-1/e approximation guarantee is proposed.

Technical Innovations

Autoregressive Context Guidance: Leveraging LLM's autoregressive nature by using previously generated questions as context for subsequent generation
Intent-Aware Generation: Extending semantic exploration space by incorporating source answers
Budget-Constrained Optimization: Providing flexible resource management mechanisms adaptable to different deployment scenarios

Experimental Setup

Datasets

Primary Dataset: 3000+ Chinese question-answer pairs from financial industry customer service chatbots
Training Set: 90,000 instances
Test Set: 90 unseen question-answer pairs with average 45 reference questions
Human Evaluation: 15 new questions for real-world use case assessment

Evaluation Metrics

Semantic Relevance

Precision: Maximum BERTScore between generated and reference questions
Recall: Maximum BERTScore between reference and generated questions
F1 Score: Harmonic mean of precision and recall

Character-Level Diversity

Distinct-N: Proportion of unique N-grams in generated questions
Distinct-Avg: Average of Distinct-1 and Distinct-2

Qualitative Evaluation

Acceptance rate evaluated by 5 industry experts based on semantic consistency and syntactic diversity criteria.

Comparison Methods

SimBERT, RoFormer-Sim (rule-based methods)
ChatGLM2 zero-shot and few-shot learning
ChatGLM2 fine-tuning (one-to-one objective)

Implementation Details

Base Model: ChatGLM2-6B
Hardware: NVIDIA A100 GPU
Training Approach: Full-parameter fine-tuning
Generation Quantity: L=20

Experimental Results

Main Results

Method	Precision	Recall	F1 Score	Distinct-Avg	Acceptance Rate
SimBERT	0.8622	0.7744	0.8160	0.1562	18.3%
RoFormer-Sim	0.8574	0.7704	0.8115	0.2073	20.0%
ChatGLM2-FT	0.8576	0.8141	0.8352	0.2910	37.9%
Context-Aware	0.8628	0.8377	0.8505	0.2800	45.0%
Intention-Enhanced	0.8622	0.8390	0.8504	0.2718	84.0%
+ Dynamic Example Selection	0.8612	0.8527	0.8569	0.2866	82.0%

Key Findings

Significant Intent Enhancement Effect: The intent-enhanced method achieves 84% acceptance rate in human evaluation, representing a 121.64% improvement over baseline methods
Scale Effects: The proposed method maintains stable precision as the number of generated questions increases, while baseline methods show significant degradation
Real Deployment Performance: Achieves 92% user satisfaction in actual banking applications, representing an 18% improvement over unaugmented baseline

Ablation Studies

Impact of Generation Quantity on Performance

Intent-enhanced method maintains high precision when generating 100 questions
Recall improves from 0.82 to 0.89
Generating just 10 questions surpasses baseline performance with 100 generated questions

Selection Algorithm Effectiveness

Greedy selection algorithm shows significant diversity improvements over random selection:

Selecting 5 from 20 questions: diversity improves from 4.37 to 5.15
Selecting 10 from 20 questions: diversity improves from 20.14 to 22.31

Case Analysis

Using certificate processing time query as an example:

Source Question: How long does it take to issue a certificate?

SimBERT Generation:

High precision: How long does it take to issue a certificate?
Low precision: How do I issue a company certificate? (off-topic)

Intent-Enhanced Generation:

High precision: How long does certificate issuance take?
Low precision: Can I issue an electronic certificate today? (demonstrates learning "electronic certificate" concept from the answer)

Data Augmentation Methods

Traditional Methods: Manual crowdsourcing, rule-based automation
Deep Learning Methods: SimBERT, RoFormer-Sim and other pre-trained models
Large Language Models: Data augmentation through prompting and fine-tuning

Retrieval-Based Chatbots

Matching-Response Framework: Using human-verified question-answer pairs to ensure accuracy
Query Matching Optimization: Improving matching performance through knowledge base expansion

Contributions of This Work

Compared to existing work, this paper is the first to systematically apply large language models to knowledge base augmentation for retrieval-based chatbots, proposing specialized training objectives and optimization frameworks.

Conclusions and Discussion

Main Conclusions

Method Effectiveness: The context-aware one-to-many generation paradigm significantly outperforms traditional methods
Importance of Intent Guidance: Incorporating source answers as context substantially improves generation quality and diversity
Practical Value: Validates commercial value through real-world deployment
New Role for LLMs: Demonstrates LLM's potential as an auxiliary tool rather than a direct interface

Limitations

Monolingual Assumption: Current method assumes customer queries are monolingual, not considering multilingual and code-switching scenarios
Evaluation Cost: High cost of human evaluation with limited scalability
Domain Dependency: Method validated in specific domain (finance), generalization capability requires further verification

Future Directions

Multilingual Support: Extension to multilingual and cross-lingual scenarios
LLM-Based Evaluation: Using LLM-as-a-judge to replace human evaluation
Larger-Scale Validation: Verifying method effectiveness across more domains and scenarios

In-Depth Evaluation

Strengths

Clear Problem Definition: First systematic definition of the SQG task, filling a research gap
Strong Method Innovation:
- One-to-many generation paradigm effectively leverages LLM's autoregressive nature
- Intent-enhancement design is ingenious, significantly improving generation quality
- Optimization framework considers practical deployment constraints
Comprehensive Experiments:
- Multi-dimensional evaluation metrics
- Real dataset validation
- Real-world deployment performance verification
High Practical Value: Addresses pain points in highly regulated industries

Limitations

Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why one-to-many paradigm is more effective
Dataset Limitations: Primarily validated in Chinese financial domain, cross-lingual and cross-domain generalization not sufficiently verified
Computational Cost Analysis: Lacks detailed analysis of training and inference computational costs
Long-Term Effects Unknown: Lacks tracking analysis of long-term deployment effects

Impact

Academic Contribution: Provides new insights for LLM application in retrieval-based systems
Industrial Value: Offers practical solutions for customer service scenarios with high compliance requirements
Method Reproducibility: Provides detailed implementation details and algorithm descriptions

Applicable Scenarios

High-Compliance Industries: Finance, healthcare, legal and other domains requiring accuracy guarantees
Multilingual Customer Service: Extendable to multilingual customer support systems
Knowledge Base Maintenance: Scenarios requiring efficient expansion and maintenance of question-answer knowledge bases
Retrieval-Augmented Systems: Various retrieval systems needing improved query matching performance

References

The paper cites multiple important related works, including:

Data augmentation methods: Wei et al. (2022), Liu et al. (2023)
Retrieval-based chatbots: Wu et al. (2018), Singh et al. (2018)
Large language model applications: Vaswani et al. (2017), Cheng et al. (2023)
Evaluation methods: Zhang et al. (2020), Li et al. (2016)

Overall Assessment: This is a high-quality applied research paper that achieves good balance between theoretical innovation and practical value. The method design is sound, experimental validation is comprehensive, and deployment verification in real business environments strengthens the paper's persuasiveness. It holds significant reference value for AI applications requiring compliance guarantees.