2025-11-18T17:40:13.411750

Domain-Specific Data Generation Framework for RAG Adaptation

Tian, Xie, Chen et al.

Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom's Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.

academic

Domain-Specific Data Generation Framework for RAG Adaptation

Basic Information

Paper ID: 2510.11217
Title: Domain-Specific Data Generation Framework for RAG Adaptation
Authors: Chris Xing Tian, Weihao Xie, Zhen Chen, Zhengyuan Yi, Hui Liu, Haoliang Li, Shiqi Wang, Siwei Ma
Classification: cs.CL cs.AI
Publication Date: October 13, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.11217

Abstract

Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning capabilities of large language models with external retrieval to enable domain-based responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data that goes beyond generic question-answering. This paper proposes RAGen, a scalable modular framework for generating domain-grounded question-answer-context (QAC) triplets tailored to different RAG adaptation methods. RAGen generates these QAC triplets by identifying key concepts in documents, generating diverse questions under principles inspired by Bloom's taxonomy, and pairing them with precise answers extracted from relevant contexts.

Research Background and Motivation

Problem Definition

Core Problem: Existing generic RAG systems perform poorly when applied to specific domains, requiring specialized domain-adaptive training data.
Key Challenges:
- Organizations prefer locally-deployed small-to-medium-scale LLMs due to data privacy concerns, regulatory compliance, and high costs
- Smaller models have limited language understanding and reasoning capabilities compared to frontier LLMs
- Existing RAG adaptation methods have narrow scope, typically targeting only single components of the RAG pipeline
- Lack of flexibility to support multi-component adaptation strategies

Research Motivation

Practical Demand: Growing need for domain-specific RAG systems in enterprise and organizational environments
Technical Gap: Existing methods rely on fixed, tightly-coupled training procedures that assume availability of high-quality domain-specific data
Scalability Requirements: Need for capability to handle large and continuously evolving document corpora

Core Contributions

Proposes RAGen Framework: A scalable modular framework for generating high-quality domain-specific QAC training data
Multi-component Adaptation Support: Enables simultaneous optimization of multiple RAG components including LLMs, retrievers, and embedding models
Cognitive-level Question Generation: Question generation strategy based on Bloom's taxonomy ensuring diversity in cognitive complexity
Cross-chunk Cross-concept Reasoning: Enables global question generation through multi-chunk retrieval and concept fusion
Distractor Context Strategy: Introduces carefully curated distractor contexts to enhance model robustness

Methodology Details

Task Definition

RAG adaptation is defined as the systematic process of optimizing individual components of retrieval-augmented generation systems (LLM, retriever, embedding model) to improve accuracy and robustness in dynamic domain-specific settings.

Model Architecture

The RAGen framework comprises three main modules:

1. Document Concepts Extraction

Semantic Chunking:

Uses llamaindex chunker to partition domain documents D into coherent chunk sets {d₁, d₂, ...}

Chunk-level Concept Extraction:

For each chunk dᵢ, uses ChatGPT-4o to extract chunk-level concept set Cᵢ = {cᵢ₁, cᵢ₂, ...}
These concepts capture the central themes of chunk dᵢ

Concept Fusion:

Fuses all chunk-level concepts based on semantic similarity
Generates deduplicated representative document-level concept set O = {o₁, o₂, ..., oₖ}
Uses OpenAI Ada embedding model for concept embedding
Applies K-means clustering algorithm to group into K semantically coherent clusters

2. Concept-centered Evidence Assembly

Cross-chunk Retrieval:

For each document-level concept, uses retriever-reranker pipeline to retrieve top-N relevant chunks
Employs dense retriever and BGE-Reranker-Base for retrieval and reranking

Evidence Extraction:

Performs sentence-level filtering within retrieved chunks
Extracts concept-focused text subsets, termed evidence e
Represented as d^{oᵢ} → {e^{oᵢ}₀, e^{oᵢ}₁, ..., e^{oᵢ}_N}

3. QAC Generation

Bloom's Question Types: Based on six cognitive levels from the revised Bloom's taxonomy:

Remembering: Identifying or recalling information
Understanding: Constructing meaning from information
Applying: Using knowledge in new situations
Analyzing: Breaking down information and seeking evidence
Evaluating: Making judgments based on criteria
Creating: Combining elements to form coherent wholes

Question Generation:

Supports multi-stem combinations, with combination level ℓ controlling the number of concepts used simultaneously
When ℓ=1 traverses all individual stems; ℓ≥2 supports cross-concept reasoning
Uses ChatGPT-4o to generate questions, reference answers, reasoning traces, and supporting evidence

Context Variant Construction: Associates four curated context variants with each question-answer instance:

Full Support: Evidence sentences directly answering the question
Partial Support: Evidence subset containing incomplete information
Irrelevant: Domain-relevant but question-irrelevant content
Misleading: Topic-related but semantically insufficient content

Technical Innovations

Global Concept Fusion: Transcends single-chunk limitations through document-level concept extraction, supporting global question generation
Multi-level Cognitive Modeling: Ensures systematic distribution of question cognitive complexity based on Bloom's taxonomy
Fine-grained Distractor Strategy: Designs four types of context variants, surpassing random sampling distraction methods
Cross-chunk Cross-concept Reasoning: Supports multi-stem combinations enabling complex logical chain reasoning

Experimental Setup

Datasets

Constructs three domain-specific datasets:

Domain	Corpus Scale (Train/Eval)	Question Count (RAGen/LlamaIndex/AutoRAG)
PPFS	15/3	2726/2502/2084
TradePolicy	20/5	1977/1820/1500
BusinessAI	17/3	2228/2118/2072

PPFS: APEC Food Safety Partnership documents
TradePolicy: Import-export regulations from 8 APEC economies
BusinessAI: AI adoption technical reports from various business departments

Evaluation Metrics

Retrieval Tasks: Recall@K (K=1,5,10), MRR@10
Generation Tasks: ROUGE-L, BERT-F1

Baseline Methods

AutoRAG: Automatic RAG pipeline configuration framework
LlamaIndex Dataset Generator: Open-source QA data generator

Implementation Details

Document chunking: 1024 token chunks with 200 token overlap
Embedding model fine-tuning: Learning rate 1e-5, 3 epochs, temperature parameter τ=0.02
LLM fine-tuning: LoRA method, learning rate 1e-5, 5 epochs

Experimental Results

Main Results

Embedding Model Customization Results

RAGen datasets achieve best performance across all embedding models in three domains:

BGE-large Model Performance on PPFS Domain:

Recall@1: RAGen(0.3095) > LlamaIndex(0.2024) > AutoRAG(0.1877)
MRR@10: RAGen(0.4626) > LlamaIndex(0.3548) > AutoRAG(0.3342)

LLM Supervised Fine-tuning Results

RAGen consistently outperforms baselines across all domains and model scales:

Qwen2.5-3B on PPFS Domain:

ROUGE-L: RAGen(0.3815) > AutoRAG(0.3436) > LlamaIndex(0.3253)
BERT-F1: RAGen(0.9079) > AutoRAG(0.8979) > LlamaIndex(0.8952)

Ablation Studies

Distractor Supervision Effects

Evaluation in realistic RAG inference settings (k=3):

Without distractor training: ROUGE-L(0.3143), BERT-F1(0.8957)
With distractor training: ROUGE-L(0.4074), BERT-F1(0.9121)

Significant improvements validate the effectiveness of distractor-aware training.

Case Analysis

Cross-concept Question Example

Question: "How does the integration of document drafting agents affect the incremental profit and loss of life sciences companies?"

Concepts: Document drafting agents & Profit and loss
Evidence Sources: Evidence from 3 non-adjacent chunks
Reasoning Depth: Requires comprehensive analysis across multiple evidence sources

Experimental Findings

Cognitive Level Distribution: RAGen generates more higher-order cognitive questions (analyzing, evaluating, creating), significantly reducing lower-level questions
Cross-concept Capability: Multi-stem combinations achieve global reasoning impossible with traditional single-chunk methods
Robustness Enhancement: Distractor context training significantly improves model performance in noisy retrieval environments

Question Generation Research

CliniQG4QA: Controlled QA pair generation in clinical domain, but relies on template-driven methods
E2EQR: Multi-hop QA generation, but lacks semantic evidence selection mechanisms
RAGEval: QA dataset evaluation in RAG contexts, but depends on scenario-specific patterns

Retrieval-Augmented Generation

DPR: Improves retrieval through dense representation learning
GraphRAG: Graph-based retrieval and decoding, but depends on predefined graph patterns
RAFT: Introduces distractor-aware supervision to improve LLM robustness
Self-RAG/OpenRAG: Inference-time retrieval control methods

Conclusions and Discussion

Main Conclusions

RAGen framework successfully generates high-quality domain-specific QAC datasets
Multi-component RAG adaptation strategies significantly outperform single-component optimization methods
Question generation based on Bloom's taxonomy ensures systematic distribution of cognitive complexity
Cross-chunk cross-concept reasoning capability enables more comprehensive domain understanding

Limitations

Document Format Constraints: Currently supports only text format documents, not PDF or multimodal inputs
Seed Document Quality Dependency: Generated data quality is significantly affected by source document quality
Manual Hyperparameter Setting: Document-level concept count K requires manual specification
Computational Cost: Reliance on ChatGPT-4o may incur substantial computational costs

Future Directions

Extend to multimodal document processing capabilities
Automate hyperparameter selection mechanisms
Reduce dependency on commercial APIs
Support larger-scale enterprise applications

In-depth Evaluation

Strengths

Methodological Innovation: First to propose a unified data generation framework supporting multi-component RAG adaptation
Solid Theoretical Foundation: Question generation based on Bloom's taxonomy has solid pedagogical theoretical foundation
Comprehensive Experiments: Validates method effectiveness across three different domains with well-designed comparative experiments
High Practical Value: Addresses practical needs of enterprise-level RAG system adaptation

Weaknesses

Evaluation Limitations: Validation on only three domains, generalization capability requires broader verification
Missing Computational Cost Analysis: Lacks detailed analysis of framework computational overhead and time complexity
Absence of Human Evaluation: Primarily relies on automatic evaluation metrics, lacks human quality assessment
Unknown Long-term Effects: Does not evaluate long-term adaptation capability in dynamically evolving domains

Impact

Academic Contribution: Provides new research paradigm for domain adaptation of RAG systems
Practical Value: Offers practical solutions for enterprise knowledge bases and research domains
Reproducibility: Detailed method description and clear experimental setup provide good reproducibility

Applicable Scenarios

Enterprise Knowledge Bases: Suitable for enterprise internal knowledge management systems requiring frequent updates
Research Literature: Appropriate for handling rapidly evolving research domain literature
Professional Consulting: Applicable to intelligent question-answering systems in legal, medical and other professional domains
Educational Training: Bloom's taxonomy characteristics make it suitable for educational scenario applications

References

The paper cites multiple important related works including seminal RAG work by Lewis et al. (2020), RAFT method by Zhang et al. (2024c), and inference-time retrieval control methods like Self-RAG by Asai et al. (2023), demonstrating comprehensive understanding of related research areas.