MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
Zhao, Ji, Niu et al.
The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
academic
MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
Traditional retrieval-augmented generation (RAG) paradigms typically respond to queries by understanding relevant text chunks, an approach that inherently limits the depth of knowledge internalization and reasoning capabilities. To address this limitation, this research transforms text processing in RAG from passive chunking to active comprehension, defined as a document memory extraction process that aims to simulate the cognitive processes of human reading. Based on this, the authors propose a scenario-aware document memory mixture (MoM) framework, designed to efficiently process multi-domain documents and train small language models (SLMs) to acquire the ability to actively explore and construct document memories.
Traditional RAG systems suffer from a fundamental cognitive gap: reducing document processing to mechanized preprocessing steps, employing a passive "chunk-first-understand-later" approach that contradicts the cognitive processes of human experts.
Loss of Semantic Integrity: Traditional chunking methods (fixed-length, recursive chunking, etc.) ignore the deep semantic coherence and logical structure of documents
Knowledge Fragmentation: Existing methods follow bottom-up construction logic, lacking macroscopic understanding of document architecture
Limited Reasoning Capacity: Passive chunking constrains the model's knowledge internalization depth and reasoning capabilities
Simulate the cognitive process of human experts reading complex documents: first grasp macroscopic logical structure, identify key arguments, and ultimately form structured, hierarchical memories.
Active Memory Extraction Paradigm: Proposes replacing passive text chunking with active memory extraction, constructing structured document memories through global understanding
Three-Layer Document Memory Retrieval Mechanism: Develops a theoretically proven retrieval algorithm based on probabilistic modeling that more effectively reduces information loss compared to traditional fusion strategies
Reverse Reasoning Strategy: Designs the Chain of Memory (CoM) extraction construction method, enabling SLMs to autonomously execute complex memory extraction tasks
Multi-Domain Validation: Validates the MoM framework effectiveness on three different domain datasets, constructing 40K training samples and training multiple MemReader models
Expert Simulation: Uses a large language model MG to simulate domain-specific experts, generating document logical outlines O through scenario-aware prompting.
Multi-Path Sampling: Adjusts MG's decoding parameters to generate N candidate document memory sets for the same document D.
Multi-Dimensional Evaluation: Designs two key quantitative evaluation metrics:
Utilizes the guidance model MG, inputting the original document D and optimal document memory Mdoc, to generate reasoning paths P, constituting high-quality CoM data.
Constructs a three-layer retrieval mechanism corresponding to O, C, and A, retrieving independently and then fusing results, theoretically proven to more effectively avoid information loss.
Atomic chunk clarity shows correlation coefficients with ROUGE-L of 0.7044, 0.7585, and 0.7248 across three evaluation models, demonstrating strong positive correlation.
The paper cites 32 related references covering RAG foundational theory, text chunking methods, memory system design and other key areas, providing solid theoretical foundation for the research.
Overall Assessment: This is an important paper with significant innovation in the RAG field. By introducing a cognitive science perspective to redefine document processing paradigms, it achieves breakthroughs both theoretically and practically. Despite some limitations, its pioneering approach and rigorous experimental validation make it an important contribution to the field.