2025-11-20T03:01:15.256535

Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

R, Upadhya

Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.

academic

Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

Basic Information

Paper ID: 2510.14592
Title: Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
Authors: Rashmi R (National Institute of Technology Karnataka), Vidyadhar Upadhya (National Institute of Technology Karnataka)
Classification: cs.LG (Machine Learning), cs.IR (Information Retrieval)
Publication Date: October 16, 2025
Paper Link: https://arxiv.org/abs/2510.14592v1

Abstract

Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal text data and have limited effectiveness when processing unstructured multimodal documents containing diverse information types such as text, images, tables, equations, and diagrams. This paper proposes the Modality-Aware Hybrid Architecture (MAHA), specifically designed for multimodal question-answering reasoning through modality-aware knowledge graphs. MAHA combines dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables semantically rich and context-aware retrieval across different modalities. Evaluation on multiple benchmark datasets demonstrates that MAHA significantly outperforms baseline methods, achieving a ROUGE-L score of 0.486 with complete modality coverage.

Research Background and Motivation

Problem Definition

Existing RAG systems face the following core challenges:

Unimodal Limitations: Traditional RAG systems primarily handle text data and cannot effectively process complex documents containing multimodal content such as images, tables, and equations
Missing Cross-Modal Relationships: Lack of capability to understand and leverage complex relationships between different modalities, such as correspondences between textual descriptions and tabular data
Insufficient Structured Reasoning: Existing methods struggle to model complex interdependencies among multimodal components

Research Significance

In the data-rich era, vast amounts of information exist in unstructured multimodal formats, including PDF documents, scanned files, and technical documents with complex tables and diagrams. Effective retrieval and synthesis of this information is crucial for decision-making across various domains.

Limitations of Existing Approaches

Insufficient Cross-Modal Alignment: Lack of mechanisms to semantically link different modality content
Static Retrieval Process: Unable to adapt to dynamic or evolving information spaces
Shallow Knowledge Graph Integration: Knowledge graphs in existing hybrid RAG frameworks are primarily text-centric, lacking explicit support for multimodal inputs
Absence of Customized Strategies: No unified strategies specifically designed to handle text, images, tables, diagrams, and equations jointly

Core Contributions

Proposes MAHA Architecture: The first modality-aware hybrid retrieval architecture specifically designed for unstructured multimodal data
Modality-Aware Knowledge Graph: Extends existing text-centric KG schemas by introducing cross-modal semantic relationships
Hybrid Retrieval Strategy: Innovatively fuses dense vector retrieval and structured graph traversal
Comprehensive Experimental Validation: Achieves significant performance improvements on multiple benchmark datasets with complete modality coverage
Novel Evaluation Metrics: Proposes modality coverage metrics to quantify the system's cross-modal retrieval capabilities

Methodology Details

Task Definition

Given a collection of unstructured documents D containing multiple modalities (text, images, tables, equations, diagrams) and a user query q, the system must:

Retrieve relevant multimodal evidence fragments
Synthesize cross-modal information to generate accurate and complete answers
Maintain interpretability and contextual consistency

Model Architecture

1. Document Ingestion and Embedding Module

Multimodal Parsing: Segments documents into semantically meaningful chunks, including text, tables, diagrams, images, and equations
Heterogeneous Encoding:
- Text: Transformed into embeddings using OpenAI text-embedding-3-small
- Tables: Converted to HTML format
- Equations: Encoded as structured equations (LaTeX)
- Visual Elements: Encoded using CLIP model and converted to base64 format
Summary Generation: Generates textual summaries for non-textual data and embeds them

2. Vector Store Indexing and Knowledge Graph Construction

Vector Store: Indexes multimodal representations supporting fast similarity-based retrieval
Modality-Aware KG:
- Nodes: Represent entities of different modalities (text, equations, images, tables)
- Edges: Capture semantic relationships such as "NEXT-TEXT", "NEXT-TABLE", "HAS-IMAGE", "HAS-FORMULA", etc.
- Construction Process: Schema-driven, including named entity linking, coreference resolution, and relationship reasoning

3. Hybrid Retrieval Mechanism

Vector Retrieval: Encodes queries into embeddings and matches semantically similar content chunks
Graph Traversal: Retrieves supporting information based on entity relationships and graph traversal
Fusion Strategy: Balances semantic similarity and structural traversal to ensure relevance and coverage

4. Context-Aware Generation

Utilizes large language models to synthesize retrieved multimodal information and generate coherent, accurate, and interpretable answers.

Technical Innovations

Cross-Modal Relationship Modeling: First to introduce explicit cross-modal semantic relationships in RAG systems
Hybrid Retrieval Fusion: Innovatively combines advantages of vector similarity and graph structure traversal
Modality-Aware Indexing: Achieves seamless integration of semantic and structured retrieval through unified indexing
Enhanced Interpretability: Graph metadata provides interpretability for retrieval decisions

Experimental Setup

Datasets

UDA Benchmark Suite:
- Finance Domain: Contains financial reports with complex layouts, testing numerical reasoning capabilities
- Academic Domain: From academic papers, testing complex technical content reasoning
- World Knowledge: Wikipedia pages, evaluating performance across diverse topics
MRAMG-Bench: From web, academic, and lifestyle domains, specifically designed to test multimodal reasoning capabilities
REAL-MM-RAG-Bench: High-quality finance domain benchmark containing text, tables, and images

Evaluation Metrics

Retrieval Metrics

Recall@K: Proportion of queries where correct document chunks appear in top K results
MRR (Mean Reciprocal Rank): Mean of reciprocal ranks of the first correct answer

Generation Metrics

ROUGE-L: Overlap of longest common subsequence between generated and reference answers

Multimodal Metrics

Modality Coverage: Newly proposed metric, calculated as:

Coverage(q) = |Mgt(q) ∩ Mret(q)| / |Mgt(q)|

where Mgt(q) is the set of modalities required in the reference answer and Mret(q) is the set of modalities retrieved by the system.

Comparison Methods

BM25: Frequency-based sparse retriever
FAISS + SBERT: Dense vector retriever
CLIP: Image-only retriever
Hybrid (BM25 + FAISS): Traditional hybrid method
Graph Traversal (KG Retriever): Pure graph traversal method
Existing Multimodal RAG Frameworks: HybridRAG, HybGRAG, KG-Guided RAG, etc.

Experimental Results

Main Results

Comparison with Baseline Methods

MAHA significantly outperforms baseline methods across all metrics:

ROUGE-L: 0.486 (72% improvement over vector retrieval)
Recall@3: 0.79-0.81
MRR: 0.74 (19-21% improvement over baselines)
Modality Coverage: 1.00 (complete coverage)

Comparison with Existing Multimodal RAG Frameworks

MAHA is the only method achieving complete modality coverage (1.00)
Other methods achieve modality coverage of only 0.00-0.39
Achieves highest scores across all performance metrics

Ablation Study

Validates component contributions through comparison of three configurations:

Vector-Only: ROUGE-L 0.282, Recall@3 0.70, MRR 0.61
Graph-Only: ROUGE-L 0.337, Recall@3 0.68, MRR 0.62
MAHA: ROUGE-L 0.486, Recall@3 0.79, MRR 0.74

Results demonstrate:

Vector retrieval captures local semantics but lacks structural cues
Graph traversal provides structural relationships but struggles to independently discover rich evidence
Hybrid approach achieves optimal performance, proving complementarity of both methods

Experimental Findings

Synergistic Effects: Combination of structural reasoning and semantic similarity produces significant synergistic effects
Importance of Cross-Modal Links: Explicit modality-aware links enable the system to retrieve multimodal evidence that would otherwise be missed
Value of Complete Coverage: Achieving complete modality coverage is crucial for generating high-quality answers

Main Research Directions

Traditional RAG Systems: Primarily text-based, using single retrieval methods such as BM25 and FAISS
Hybrid RAG Frameworks: Combine knowledge graphs with vector retrieval, but KGs are primarily text-centric
Multimodal RAG: Such as Kosmos-1 and MM-ReAct, but mostly operate in closed settings
Knowledge Graph-Enhanced RAG: Improve retrieval diversity through KGs, but lack visual encoding modules

Advantages of This Work

Compared to existing work, MAHA offers the following advantages:

First specifically designed modality-aware KG architecture
Explicitly models cross-modal semantic relationships
Provides fine-grained modality-aware retrieval control
Achieves complete modality coverage and interpretability

Conclusions and Discussion

Main Conclusions

Technical Breakthrough: MAHA successfully addresses limitations of traditional RAG systems in multimodal data processing
Performance Improvement: Achieves significant performance improvements on multiple benchmark datasets, particularly 72% improvement in ROUGE-L metric
Complete Coverage: First to achieve complete modality coverage, demonstrating effectiveness of cross-modal reasoning
Scalability: Provides a scalable and interpretable retrieval framework

Limitations

KG Construction Complexity: Construction of modality-aware knowledge graphs requires specialized parsing and alignment strategies
Computational Overhead: Hybrid retrieval mechanism may increase computational complexity
Domain Adaptability: Adaptation capabilities in specific domains require further verification
Dynamic Updates: Static KGs face challenges in handling dynamic information updates

Future Directions

Automated KG Construction: Develop more advanced automated methods for handling highly unstructured data
Dynamic Query Routing: Implement intelligent routers that adapt to query complexity in real-time
Larger-Scale Evaluation: Validate the method on larger-scale and more diverse datasets
Real-Time Optimization: Optimize system response time to improve practical applicability

In-Depth Evaluation

Strengths

Strong Innovation: First to propose the concept of modality-aware knowledge graphs, filling an important gap in multimodal RAG
Complete Methodology: End-to-end solution from data ingestion to final generation
Comprehensive Experiments: Thorough evaluation on multiple datasets, including ablation studies
Metric Innovation: Proposes modality coverage as an important evaluation metric
Significant Results: Achieves substantial improvements across all key metrics

Weaknesses

High Complexity: Relatively complex system architecture that may face deployment challenges
Dataset Scale: Evaluation datasets may have limited scale and diversity
Insufficient Error Analysis: Lacks in-depth analysis of failure cases
Computational Cost: Paper does not thoroughly discuss computational resource requirements and efficiency
Generalization Ability: Generalization capability on unseen domains and data types requires further verification

Impact

Academic Value: Provides new research directions and benchmarks for multimodal information retrieval
Practical Value: Has broad application prospects in document analysis, technical support, education, and other fields
Reproducibility: Paper provides detailed implementation details facilitating subsequent research
Inspirational Value: Modality-aware KG concept may inspire research in other multimodal tasks

Applicable Scenarios

Enterprise Document Analysis: Processing financial reports and technical documents containing diagrams and tables
Academic Research Support: Assisting researchers in extracting information from multimodal academic papers
Educational Assistance: Providing cross-modal knowledge question-answering services to students
Medical Document Processing: Analyzing medical reports containing images and tables
Legal Document Review: Processing complex legal documents and evidence materials

References

The paper cites 32 related references, primarily including:

RAG Fundamentals: Classical retrieval methods such as BM25, FAISS, SBERT
Multimodal Models: CLIP, Kosmos-1, MM-ReAct, etc.
Knowledge Graph Methods: Various KG-enhanced RAG frameworks
Evaluation Benchmarks: UDA, MRAMG-Bench, REAL-MM-RAG-Bench, etc.

Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the important and challenging problem of multimodal RAG. The MAHA architecture achieves significant technical breakthroughs through modality-aware knowledge graphs and hybrid retrieval strategies, with convincing experimental results. While there remains room for improvement in complexity and generalization capability, this work establishes an important foundation for multimodal information retrieval research and possesses high academic value and practical potential.