2025-11-23T14:31:17.888154

Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models

Shim, Ju, Park et al.

Recent advancements in large language models (LLMs) have shown strong performance in natural language understanding and generation tasks. However, LLMs continue to encounter challenges with hallucinations, where models generate plausible but incorrect information. While several factors contribute to hallucinations, the impact of ill-formed prompts, prompts with ambiguous wording, incorrect grammar, or incomplete information, was relatively under explored. To address this, we introduce Multi-stage Prompt Refinement (MPR), a framework designed to systematically improve these ill-formed prompts across multiple stages. Each stage addresses specific errors such as punctuation, typographical mistakes, and misuse of key terms, using small language models (SLMs) fine-tuned for these tasks. MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input. Experimental results on hallucination benchmarks show that prompts refined by MPR achieve over an 85~\% win rate compared to their original forms, demonstrating its effectiveness in reducing hallucinations and improving LLM output accuracy. Interestingly, we reveal that MPR can be combined with existing post-hoc hallucination mitigation frameworks, further enhancing its versatility. MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains.

academic

Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models

Basic Information

Paper ID: 2510.12032
Title: Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models
Authors: Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee
Institution: Korea University, Department of Artificial Intelligence
Classification: cs.CL cs.AI cs.LG
Publication Date: October 14, 2025 (arXiv)
Paper Link: https://arxiv.org/abs/2510.12032

Abstract

Large language models demonstrate exceptional performance in natural language understanding and generation tasks, yet still face the hallucination problem—generating information that appears plausible but is factually incorrect. While multiple factors contribute to hallucinations, the impact of poorly formatted prompts (containing ambiguous phrasing, grammatical errors, or incomplete information) remains relatively underexplored. This paper proposes a Multi-stage Prompt Refinement framework (MPR) that systematically improves such poorly formatted prompts through multiple stages. Each stage employs a small language model (SLM) fine-tuned for specific tasks to address concrete issues such as punctuation, spelling errors, and keyword misuse. MPR iteratively enhances prompt clarity through a self-reflection mechanism and ranking to prioritize the most relevant inputs. Experimental results demonstrate that prompts optimized by MPR achieve over 85% win rate compared to their original form, effectively reducing hallucinations and improving LLM output accuracy.

Research Background and Motivation

Problem Definition

Although large language models excel in multiple NLP tasks, they face a critical challenge: the hallucination problem, wherein models generate information that appears reasonable but is factually incorrect. This is particularly dangerous in critical domains such as healthcare and education, where accurate information transmission is paramount.

Limitations of Existing Methods

Current approaches to mitigating hallucinations primarily focus on:

Model Architecture Modification: Altering LLM internal mechanisms, but at high computational cost
Post-processing Techniques: Verifying content after generation, adding system complexity and latency
Reinforcement Learning Fine-tuning: Requiring substantial computational resources, difficult for real-time applications

These methods typically overlook an important factor: the quality of user prompts. Poorly formatted prompts directly lead to inaccurate outputs, yet existing solutions often rely on large models or computationally intensive techniques.

Research Motivation

This paper posits that systematically optimizing input prompt quality can reduce hallucination problems at their source. Compared to modifying model architectures or post-processing outputs, prompt optimization represents a more lightweight and scalable solution.

Core Contributions

Proposes MPR Framework: The first systematic multi-stage optimization framework addressing hallucinations caused by poorly formatted prompts
Lightweight Design: Employs small language models (SLMs) rather than large models, significantly reducing computational costs
Model Agnosticism: Seamlessly integrates with any LLM architecture, demonstrating high adaptability
Comprehensive Evaluation: Validates effectiveness across multiple datasets with win rates exceeding 85%
Compatibility Verification: Demonstrates compatibility with existing post-processing hallucination mitigation methods for further performance enhancement

Methodology Details

Task Definition

Input: Poorly formatted user prompts (containing punctuation errors, spelling mistakes, grammatical issues, terminology misuse, etc.) Output: High-quality prompts optimized through multi-stage refinement Objective: Reduce hallucinations in LLM-generated content and improve output accuracy and relevance

Model Architecture

The MPR framework comprises three main stages:

Stage 1: Error Detection and Classification

Employs a specialized fine-tuned SLM to identify error types in prompts, classifying them as:

Stage 1 Errors: Basic punctuation and capitalization errors
Stage 2 Errors: Spelling and grammatical errors
Stage 3 Errors: Semantic ambiguity and terminology misuse

Stage 2: Multi-stage Prompt Cleaning

Applies corresponding specialized SLMs for correction based on error types:

Stage 1: Punctuation Correction

Input: "what is the caPital of fRAnce?"
Output: "What is the capital of France?"

Stage 2: Spelling and Grammar Correction

Input: "See from spaiin moroco?"
Output: "Can you see Spain from Morocco?"

Stage 3: Semantic Alignment and Rewriting

Input: "Tell me about transformers"
Output: "Can you explain how Transformer-based neural networks work?"

Stage 3: Iterative Description Generation

Description Generation: Adds contextual information for ambiguous terms
Self-reflection Verification: Evaluates description adequacy and conciseness
Perplexity-based Ranking: Selects the most coherent and relevant descriptions
Intelligent Integration: Adds descriptions only when necessary, improving efficiency

Technical Innovations

Staged Processing Strategy: Different error types require different handling methods; staged processing is more precise and effective
SLM Specialization: Each SLM is fine-tuned for specific tasks, ensuring quality while maintaining efficiency
QLoRA Fine-tuning Technique: Employs 4-bit quantized low-rank adaptation, reducing memory requirements while preserving performance
Adaptive Description Generation: Dynamically generates descriptions as needed, avoiding unnecessary computational overhead

Experimental Setup

Datasets

Training Data Construction:

OLM Wikipedia Dataset: 10,000 grammatically perfect entries for punctuation and grammar optimization
CoEdIT Dataset: Focuses on fluency, coherence, and style-preserving non-semantic edits
MQR Dataset: 2,114 question rewriting pairs for semantic equivalence transformation training
Magpie Dataset: 300,000 keyword-description pairs for terminology explanation generation

Evaluation Datasets:

Well-formed Query Dataset: 8,000 user queries with format quality scores below 0.5
GSM8K: Mathematical problem dataset
SQuAD: Reading comprehension dataset
Natural Questions: Natural question dataset

Corruption Strategy: To thoroughly test the framework, three levels of errors are artificially introduced:

Stage 1: Basic punctuation errors
Stage 2: Spelling and grammatical errors
Stage 3: Technical terminology and abbreviation errors

Evaluation Metrics

Hallucination Index (HI): Quantifies factual accuracy of generated content (0-1, lower is better)
Content Quality Score (CQS): Measures relevance, coherence, and overall quality (0-1, higher is better)
Win Rate (WR): Percentage advantage of MPR-optimized prompts over original prompts
Processing Time (T): Framework efficiency assessment

Baseline Methods

SelfCheckGPT: Zero-resource black-box hallucination detection method
CoVE: Chain of verification method
DRESS: Natural language feedback-based alignment method
MixAlign: Knowledge alignment method

Implementation Details

Hardware: NVIDIA RTX A6000 GPU for training, NVIDIA TITAN V GPU for inference
Fine-tuning Method: QLoRA (4-bit quantized low-rank adaptation)
Evaluator: GPT-3.5-turbo API as primary evaluation standard

Experimental Results

Main Results

Performance on the Well-formed Query dataset:

Model	Corruption Level	HI ↓	CQS ↑	WR ↑
Baseline	-	0.81	0.52	-
LLaMA-2 (7B)	Stage 1	0.26 (-0.55)	0.80 (+0.28)	91%
LLaMA-2 (7B)	Stage 3	0.48 (-0.33)	0.60 (+0.08)	86%
Average Performance	-	0.37 (-0.44)	0.68 (+0.16)	86%

Key Findings

Consistent Improvement: MPR demonstrates significant improvements across all tested models and datasets
Corruption Level Correlation: Higher corruption levels show more pronounced MPR improvements
Model Scale Effect: Larger models (e.g., LLaMA-3.2) benefit more from MPR's description generation step
Cross-domain Effectiveness: Effective across diverse tasks including mathematics (GSM8K), reading comprehension (SQuAD), and question answering (NQ)

Ablation Study

Configuration	HI ↓	CQS ↑	WR ↑
Complete MPR	0.14	0.83	93%
Without Description Generation	0.20	0.78	89%
Without Multi-stage Cleaning	0.24	0.74	86%
Without Iterative Ranking	0.21	0.75	87%

Results demonstrate that each component contributes significantly to overall performance, with multi-stage cleaning being the most critical component.

Comparison with Existing Methods

Framework	HI ↓	CQS ↑	WR ↑	Processing Time (ms)
MPR	0.18	0.81	91%	1215
SelfCheckGPT	0.22	0.76	85%	1541
SelfCheckGPT + MPR	0.14	0.85	94%	1478

MPR not only performs excellently independently but achieves even better results when combined with existing methods.

Hallucination Mitigation Methods

Existing approaches fall into three main categories:

Architecture Modification: Adjusting model internal mechanisms, high computational cost
Post-processing Verification: Verifying content after generation, adding latency
Reinforcement Learning: Rewarding factual responses, requiring substantial computational resources

Small Language Model Applications

SLMs can achieve excellent performance on specific tasks through fine-tuning, particularly suitable for:

Resource-constrained environments
Real-time applications
Domain-specific tasks

Prompt Optimization Techniques

Traditional methods include:

LLM-based prompt rewriting (high computational cost)
Reinforcement learning-based iterative improvement
Manual intervention optimization

MPR achieves lightweight prompt optimization through the use of small models.

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: MPR demonstrates excellent performance in reducing hallucinations and improving output quality
Lightweight Design: Significantly reduces computational costs compared to existing methods
Broad Applicability: Compatible with multiple LLM architectures and existing mitigation methods
Practical Value: Provides a scalable solution for real-world applications

Limitations

Domain Specificity: May underperform in specialized domains such as law and medicine
Evaluation Metric Limitations: Existing metrics do not fully capture user satisfaction and fluency
Automation Level: While fully automated, may benefit from human-in-the-loop systems

Future Directions

Domain Specialization: Develop fine-tuning strategies tailored to specific domains
Multimodal Extension: Extend the framework to multimodal environments such as image-text
Human-Machine Collaboration: Integrate human feedback mechanisms
Evaluation Framework: Develop more comprehensive user-centric evaluation methods

In-depth Evaluation

Strengths

Strong Innovation: First systematic approach to addressing hallucinations from the perspective of prompt quality
Reasonable Design: Multi-stage processing strategy precisely targets different error types
High Practicality: Lightweight design makes it feasible in resource-constrained environments
Comprehensive Experiments: Thorough evaluation across multiple datasets and models
Good Compatibility: Combines with existing methods for further performance enhancement

Weaknesses

Domain Limitations: Performance in specialized domains requires further validation
Language Constraints: Primarily targets English; multilingual support is unclear
Complexity Assessment: While claimed to be lightweight, multi-stage processing still involves certain complexity
Long-term Effects: Performance in extended dialogues or complex tasks remains unevaluated

Impact

Academic Value: Provides new research direction for hallucination mitigation
Practical Value: Offers viable optimization solution for real-world LLM deployment
Reproducibility: Detailed method description facilitates reproduction and improvement
Extensibility: Framework design demonstrates good extension potential

Applicable Scenarios

Resource-constrained Environments: Edge devices, mobile applications
Real-time Systems: Interactive systems requiring rapid response
Quality-sensitive Applications: Education, customer service, and other scenarios with high accuracy requirements
Existing System Upgrades: Integration as plugin into existing LLM systems

References

This paper cites 27 important references covering recent research in large language models, hallucination detection, prompt engineering, small model applications, and related fields, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality research paper proposing an innovative solution to address LLM hallucination problems. The MPR framework is elegantly designed with comprehensive experiments and convincing results. Despite certain limitations, its lightweight and modular design provides high practical value and extension potential.