Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom's Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.
Domain-Specific Data Generation Framework for RAG Adaptation
- Paper ID: 2510.11217
- Title: Domain-Specific Data Generation Framework for RAG Adaptation
- Authors: Chris Xing Tian, Weihao Xie, Zhen Chen, Zhengyuan Yi, Hui Liu, Haoliang Li, Shiqi Wang, Siwei Ma
- Classification: cs.CL cs.AI
- Publication Date: October 13, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.11217
Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning capabilities of large language models with external retrieval to enable domain-based responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data that goes beyond generic question-answering. This paper proposes RAGen, a scalable modular framework for generating domain-grounded question-answer-context (QAC) triplets tailored to different RAG adaptation methods. RAGen generates these QAC triplets by identifying key concepts in documents, generating diverse questions under principles inspired by Bloom's taxonomy, and pairing them with precise answers extracted from relevant contexts.
- Core Problem: Existing generic RAG systems perform poorly when applied to specific domains, requiring specialized domain-adaptive training data.
- Key Challenges:
- Organizations prefer locally-deployed small-to-medium-scale LLMs due to data privacy concerns, regulatory compliance, and high costs
- Smaller models have limited language understanding and reasoning capabilities compared to frontier LLMs
- Existing RAG adaptation methods have narrow scope, typically targeting only single components of the RAG pipeline
- Lack of flexibility to support multi-component adaptation strategies
- Practical Demand: Growing need for domain-specific RAG systems in enterprise and organizational environments
- Technical Gap: Existing methods rely on fixed, tightly-coupled training procedures that assume availability of high-quality domain-specific data
- Scalability Requirements: Need for capability to handle large and continuously evolving document corpora
- Proposes RAGen Framework: A scalable modular framework for generating high-quality domain-specific QAC training data
- Multi-component Adaptation Support: Enables simultaneous optimization of multiple RAG components including LLMs, retrievers, and embedding models
- Cognitive-level Question Generation: Question generation strategy based on Bloom's taxonomy ensuring diversity in cognitive complexity
- Cross-chunk Cross-concept Reasoning: Enables global question generation through multi-chunk retrieval and concept fusion
- Distractor Context Strategy: Introduces carefully curated distractor contexts to enhance model robustness
RAG adaptation is defined as the systematic process of optimizing individual components of retrieval-augmented generation systems (LLM, retriever, embedding model) to improve accuracy and robustness in dynamic domain-specific settings.
The RAGen framework comprises three main modules:
Semantic Chunking:
- Uses llamaindex chunker to partition domain documents D into coherent chunk sets {d₁, d₂, ...}
Chunk-level Concept Extraction:
- For each chunk dᵢ, uses ChatGPT-4o to extract chunk-level concept set Cᵢ = {cᵢ₁, cᵢ₂, ...}
- These concepts capture the central themes of chunk dᵢ
Concept Fusion:
- Fuses all chunk-level concepts based on semantic similarity
- Generates deduplicated representative document-level concept set O = {o₁, o₂, ..., oₖ}
- Uses OpenAI Ada embedding model for concept embedding
- Applies K-means clustering algorithm to group into K semantically coherent clusters
Cross-chunk Retrieval:
- For each document-level concept, uses retriever-reranker pipeline to retrieve top-N relevant chunks
- Employs dense retriever and BGE-Reranker-Base for retrieval and reranking
Evidence Extraction:
- Performs sentence-level filtering within retrieved chunks
- Extracts concept-focused text subsets, termed evidence e
- Represented as d^{oᵢ} → {e^{oᵢ}₀, e^{oᵢ}₁, ..., e^{oᵢ}_N}
Bloom's Question Types:
Based on six cognitive levels from the revised Bloom's taxonomy:
- Remembering: Identifying or recalling information
- Understanding: Constructing meaning from information
- Applying: Using knowledge in new situations
- Analyzing: Breaking down information and seeking evidence
- Evaluating: Making judgments based on criteria
- Creating: Combining elements to form coherent wholes
Question Generation:
- Supports multi-stem combinations, with combination level ℓ controlling the number of concepts used simultaneously
- When ℓ=1 traverses all individual stems; ℓ≥2 supports cross-concept reasoning
- Uses ChatGPT-4o to generate questions, reference answers, reasoning traces, and supporting evidence
Context Variant Construction:
Associates four curated context variants with each question-answer instance:
- Full Support: Evidence sentences directly answering the question
- Partial Support: Evidence subset containing incomplete information
- Irrelevant: Domain-relevant but question-irrelevant content
- Misleading: Topic-related but semantically insufficient content
- Global Concept Fusion: Transcends single-chunk limitations through document-level concept extraction, supporting global question generation
- Multi-level Cognitive Modeling: Ensures systematic distribution of question cognitive complexity based on Bloom's taxonomy
- Fine-grained Distractor Strategy: Designs four types of context variants, surpassing random sampling distraction methods
- Cross-chunk Cross-concept Reasoning: Supports multi-stem combinations enabling complex logical chain reasoning
Constructs three domain-specific datasets:
| Domain | Corpus Scale (Train/Eval) | Question Count (RAGen/LlamaIndex/AutoRAG) |
|---|
| PPFS | 15/3 | 2726/2502/2084 |
| TradePolicy | 20/5 | 1977/1820/1500 |
| BusinessAI | 17/3 | 2228/2118/2072 |
- PPFS: APEC Food Safety Partnership documents
- TradePolicy: Import-export regulations from 8 APEC economies
- BusinessAI: AI adoption technical reports from various business departments
- Retrieval Tasks: Recall@K (K=1,5,10), MRR@10
- Generation Tasks: ROUGE-L, BERT-F1
- AutoRAG: Automatic RAG pipeline configuration framework
- LlamaIndex Dataset Generator: Open-source QA data generator
- Document chunking: 1024 token chunks with 200 token overlap
- Embedding model fine-tuning: Learning rate 1e-5, 3 epochs, temperature parameter τ=0.02
- LLM fine-tuning: LoRA method, learning rate 1e-5, 5 epochs
RAGen datasets achieve best performance across all embedding models in three domains:
BGE-large Model Performance on PPFS Domain:
- Recall@1: RAGen(0.3095) > LlamaIndex(0.2024) > AutoRAG(0.1877)
- MRR@10: RAGen(0.4626) > LlamaIndex(0.3548) > AutoRAG(0.3342)
RAGen consistently outperforms baselines across all domains and model scales:
Qwen2.5-3B on PPFS Domain:
- ROUGE-L: RAGen(0.3815) > AutoRAG(0.3436) > LlamaIndex(0.3253)
- BERT-F1: RAGen(0.9079) > AutoRAG(0.8979) > LlamaIndex(0.8952)
Evaluation in realistic RAG inference settings (k=3):
- Without distractor training: ROUGE-L(0.3143), BERT-F1(0.8957)
- With distractor training: ROUGE-L(0.4074), BERT-F1(0.9121)
Significant improvements validate the effectiveness of distractor-aware training.
Question: "How does the integration of document drafting agents affect the incremental profit and loss of life sciences companies?"
- Concepts: Document drafting agents & Profit and loss
- Evidence Sources: Evidence from 3 non-adjacent chunks
- Reasoning Depth: Requires comprehensive analysis across multiple evidence sources
- Cognitive Level Distribution: RAGen generates more higher-order cognitive questions (analyzing, evaluating, creating), significantly reducing lower-level questions
- Cross-concept Capability: Multi-stem combinations achieve global reasoning impossible with traditional single-chunk methods
- Robustness Enhancement: Distractor context training significantly improves model performance in noisy retrieval environments
- CliniQG4QA: Controlled QA pair generation in clinical domain, but relies on template-driven methods
- E2EQR: Multi-hop QA generation, but lacks semantic evidence selection mechanisms
- RAGEval: QA dataset evaluation in RAG contexts, but depends on scenario-specific patterns
- DPR: Improves retrieval through dense representation learning
- GraphRAG: Graph-based retrieval and decoding, but depends on predefined graph patterns
- RAFT: Introduces distractor-aware supervision to improve LLM robustness
- Self-RAG/OpenRAG: Inference-time retrieval control methods
- RAGen framework successfully generates high-quality domain-specific QAC datasets
- Multi-component RAG adaptation strategies significantly outperform single-component optimization methods
- Question generation based on Bloom's taxonomy ensures systematic distribution of cognitive complexity
- Cross-chunk cross-concept reasoning capability enables more comprehensive domain understanding
- Document Format Constraints: Currently supports only text format documents, not PDF or multimodal inputs
- Seed Document Quality Dependency: Generated data quality is significantly affected by source document quality
- Manual Hyperparameter Setting: Document-level concept count K requires manual specification
- Computational Cost: Reliance on ChatGPT-4o may incur substantial computational costs
- Extend to multimodal document processing capabilities
- Automate hyperparameter selection mechanisms
- Reduce dependency on commercial APIs
- Support larger-scale enterprise applications
- Methodological Innovation: First to propose a unified data generation framework supporting multi-component RAG adaptation
- Solid Theoretical Foundation: Question generation based on Bloom's taxonomy has solid pedagogical theoretical foundation
- Comprehensive Experiments: Validates method effectiveness across three different domains with well-designed comparative experiments
- High Practical Value: Addresses practical needs of enterprise-level RAG system adaptation
- Evaluation Limitations: Validation on only three domains, generalization capability requires broader verification
- Missing Computational Cost Analysis: Lacks detailed analysis of framework computational overhead and time complexity
- Absence of Human Evaluation: Primarily relies on automatic evaluation metrics, lacks human quality assessment
- Unknown Long-term Effects: Does not evaluate long-term adaptation capability in dynamically evolving domains
- Academic Contribution: Provides new research paradigm for domain adaptation of RAG systems
- Practical Value: Offers practical solutions for enterprise knowledge bases and research domains
- Reproducibility: Detailed method description and clear experimental setup provide good reproducibility
- Enterprise Knowledge Bases: Suitable for enterprise internal knowledge management systems requiring frequent updates
- Research Literature: Appropriate for handling rapidly evolving research domain literature
- Professional Consulting: Applicable to intelligent question-answering systems in legal, medical and other professional domains
- Educational Training: Bloom's taxonomy characteristics make it suitable for educational scenario applications
The paper cites multiple important related works including seminal RAG work by Lewis et al. (2020), RAFT method by Zhang et al. (2024c), and inference-time retrieval control methods like Self-RAG by Asai et al. (2023), demonstrating comprehensive understanding of related research areas.