Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
R, Upadhya
Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
academic
Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal text data and have limited effectiveness when processing unstructured multimodal documents containing diverse information types such as text, images, tables, equations, and diagrams. This paper proposes the Modality-Aware Hybrid Architecture (MAHA), specifically designed for multimodal question-answering reasoning through modality-aware knowledge graphs. MAHA combines dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables semantically rich and context-aware retrieval across different modalities. Evaluation on multiple benchmark datasets demonstrates that MAHA significantly outperforms baseline methods, achieving a ROUGE-L score of 0.486 with complete modality coverage.
Existing RAG systems face the following core challenges:
Unimodal Limitations: Traditional RAG systems primarily handle text data and cannot effectively process complex documents containing multimodal content such as images, tables, and equations
Missing Cross-Modal Relationships: Lack of capability to understand and leverage complex relationships between different modalities, such as correspondences between textual descriptions and tabular data
Insufficient Structured Reasoning: Existing methods struggle to model complex interdependencies among multimodal components
In the data-rich era, vast amounts of information exist in unstructured multimodal formats, including PDF documents, scanned files, and technical documents with complex tables and diagrams. Effective retrieval and synthesis of this information is crucial for decision-making across various domains.
Insufficient Cross-Modal Alignment: Lack of mechanisms to semantically link different modality content
Static Retrieval Process: Unable to adapt to dynamic or evolving information spaces
Shallow Knowledge Graph Integration: Knowledge graphs in existing hybrid RAG frameworks are primarily text-centric, lacking explicit support for multimodal inputs
Absence of Customized Strategies: No unified strategies specifically designed to handle text, images, tables, diagrams, and equations jointly
Given a collection of unstructured documents D containing multiple modalities (text, images, tables, equations, diagrams) and a user query q, the system must:
Retrieve relevant multimodal evidence fragments
Synthesize cross-modal information to generate accurate and complete answers
Maintain interpretability and contextual consistency
The paper cites 32 related references, primarily including:
RAG Fundamentals: Classical retrieval methods such as BM25, FAISS, SBERT
Multimodal Models: CLIP, Kosmos-1, MM-ReAct, etc.
Knowledge Graph Methods: Various KG-enhanced RAG frameworks
Evaluation Benchmarks: UDA, MRAMG-Bench, REAL-MM-RAG-Bench, etc.
Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the important and challenging problem of multimodal RAG. The MAHA architecture achieves significant technical breakthroughs through modality-aware knowledge graphs and hybrid retrieval strategies, with convincing experimental results. While there remains room for improvement in complexity and generalization capability, this work establishes an important foundation for multimodal information retrieval research and possesses high academic value and practical potential.