2025-11-20T03:01:15.256535

Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

R, Upadhya
Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
academic

Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

Basic Information

  • Paper ID: 2510.14592
  • Title: Multimodal RAG for Unstructured Data: Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval
  • Authors: Rashmi R (National Institute of Technology Karnataka), Vidyadhar Upadhya (National Institute of Technology Karnataka)
  • Classification: cs.LG (Machine Learning), cs.IR (Information Retrieval)
  • Publication Date: October 16, 2025
  • Paper Link: https://arxiv.org/abs/2510.14592v1

Abstract

Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal text data and have limited effectiveness when processing unstructured multimodal documents containing diverse information types such as text, images, tables, equations, and diagrams. This paper proposes the Modality-Aware Hybrid Architecture (MAHA), specifically designed for multimodal question-answering reasoning through modality-aware knowledge graphs. MAHA combines dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables semantically rich and context-aware retrieval across different modalities. Evaluation on multiple benchmark datasets demonstrates that MAHA significantly outperforms baseline methods, achieving a ROUGE-L score of 0.486 with complete modality coverage.

Research Background and Motivation

Problem Definition

Existing RAG systems face the following core challenges:

  1. Unimodal Limitations: Traditional RAG systems primarily handle text data and cannot effectively process complex documents containing multimodal content such as images, tables, and equations
  2. Missing Cross-Modal Relationships: Lack of capability to understand and leverage complex relationships between different modalities, such as correspondences between textual descriptions and tabular data
  3. Insufficient Structured Reasoning: Existing methods struggle to model complex interdependencies among multimodal components

Research Significance

In the data-rich era, vast amounts of information exist in unstructured multimodal formats, including PDF documents, scanned files, and technical documents with complex tables and diagrams. Effective retrieval and synthesis of this information is crucial for decision-making across various domains.

Limitations of Existing Approaches

  1. Insufficient Cross-Modal Alignment: Lack of mechanisms to semantically link different modality content
  2. Static Retrieval Process: Unable to adapt to dynamic or evolving information spaces
  3. Shallow Knowledge Graph Integration: Knowledge graphs in existing hybrid RAG frameworks are primarily text-centric, lacking explicit support for multimodal inputs
  4. Absence of Customized Strategies: No unified strategies specifically designed to handle text, images, tables, diagrams, and equations jointly

Core Contributions

  1. Proposes MAHA Architecture: The first modality-aware hybrid retrieval architecture specifically designed for unstructured multimodal data
  2. Modality-Aware Knowledge Graph: Extends existing text-centric KG schemas by introducing cross-modal semantic relationships
  3. Hybrid Retrieval Strategy: Innovatively fuses dense vector retrieval and structured graph traversal
  4. Comprehensive Experimental Validation: Achieves significant performance improvements on multiple benchmark datasets with complete modality coverage
  5. Novel Evaluation Metrics: Proposes modality coverage metrics to quantify the system's cross-modal retrieval capabilities

Methodology Details

Task Definition

Given a collection of unstructured documents D containing multiple modalities (text, images, tables, equations, diagrams) and a user query q, the system must:

  1. Retrieve relevant multimodal evidence fragments
  2. Synthesize cross-modal information to generate accurate and complete answers
  3. Maintain interpretability and contextual consistency

Model Architecture

1. Document Ingestion and Embedding Module

  • Multimodal Parsing: Segments documents into semantically meaningful chunks, including text, tables, diagrams, images, and equations
  • Heterogeneous Encoding:
    • Text: Transformed into embeddings using OpenAI text-embedding-3-small
    • Tables: Converted to HTML format
    • Equations: Encoded as structured equations (LaTeX)
    • Visual Elements: Encoded using CLIP model and converted to base64 format
  • Summary Generation: Generates textual summaries for non-textual data and embeds them

2. Vector Store Indexing and Knowledge Graph Construction

  • Vector Store: Indexes multimodal representations supporting fast similarity-based retrieval
  • Modality-Aware KG:
    • Nodes: Represent entities of different modalities (text, equations, images, tables)
    • Edges: Capture semantic relationships such as "NEXT-TEXT", "NEXT-TABLE", "HAS-IMAGE", "HAS-FORMULA", etc.
    • Construction Process: Schema-driven, including named entity linking, coreference resolution, and relationship reasoning

3. Hybrid Retrieval Mechanism

  • Vector Retrieval: Encodes queries into embeddings and matches semantically similar content chunks
  • Graph Traversal: Retrieves supporting information based on entity relationships and graph traversal
  • Fusion Strategy: Balances semantic similarity and structural traversal to ensure relevance and coverage

4. Context-Aware Generation

Utilizes large language models to synthesize retrieved multimodal information and generate coherent, accurate, and interpretable answers.

Technical Innovations

  1. Cross-Modal Relationship Modeling: First to introduce explicit cross-modal semantic relationships in RAG systems
  2. Hybrid Retrieval Fusion: Innovatively combines advantages of vector similarity and graph structure traversal
  3. Modality-Aware Indexing: Achieves seamless integration of semantic and structured retrieval through unified indexing
  4. Enhanced Interpretability: Graph metadata provides interpretability for retrieval decisions

Experimental Setup

Datasets

  1. UDA Benchmark Suite:
    • Finance Domain: Contains financial reports with complex layouts, testing numerical reasoning capabilities
    • Academic Domain: From academic papers, testing complex technical content reasoning
    • World Knowledge: Wikipedia pages, evaluating performance across diverse topics
  2. MRAMG-Bench: From web, academic, and lifestyle domains, specifically designed to test multimodal reasoning capabilities
  3. REAL-MM-RAG-Bench: High-quality finance domain benchmark containing text, tables, and images

Evaluation Metrics

Retrieval Metrics

  • Recall@K: Proportion of queries where correct document chunks appear in top K results
  • MRR (Mean Reciprocal Rank): Mean of reciprocal ranks of the first correct answer

Generation Metrics

  • ROUGE-L: Overlap of longest common subsequence between generated and reference answers

Multimodal Metrics

  • Modality Coverage: Newly proposed metric, calculated as:
Coverage(q) = |Mgt(q) ∩ Mret(q)| / |Mgt(q)|

where Mgt(q) is the set of modalities required in the reference answer and Mret(q) is the set of modalities retrieved by the system.

Comparison Methods

  1. BM25: Frequency-based sparse retriever
  2. FAISS + SBERT: Dense vector retriever
  3. CLIP: Image-only retriever
  4. Hybrid (BM25 + FAISS): Traditional hybrid method
  5. Graph Traversal (KG Retriever): Pure graph traversal method
  6. Existing Multimodal RAG Frameworks: HybridRAG, HybGRAG, KG-Guided RAG, etc.

Experimental Results

Main Results

Comparison with Baseline Methods

MAHA significantly outperforms baseline methods across all metrics:

  • ROUGE-L: 0.486 (72% improvement over vector retrieval)
  • Recall@3: 0.79-0.81
  • MRR: 0.74 (19-21% improvement over baselines)
  • Modality Coverage: 1.00 (complete coverage)

Comparison with Existing Multimodal RAG Frameworks

  • MAHA is the only method achieving complete modality coverage (1.00)
  • Other methods achieve modality coverage of only 0.00-0.39
  • Achieves highest scores across all performance metrics

Ablation Study

Validates component contributions through comparison of three configurations:

  1. Vector-Only: ROUGE-L 0.282, Recall@3 0.70, MRR 0.61
  2. Graph-Only: ROUGE-L 0.337, Recall@3 0.68, MRR 0.62
  3. MAHA: ROUGE-L 0.486, Recall@3 0.79, MRR 0.74

Results demonstrate:

  • Vector retrieval captures local semantics but lacks structural cues
  • Graph traversal provides structural relationships but struggles to independently discover rich evidence
  • Hybrid approach achieves optimal performance, proving complementarity of both methods

Experimental Findings

  1. Synergistic Effects: Combination of structural reasoning and semantic similarity produces significant synergistic effects
  2. Importance of Cross-Modal Links: Explicit modality-aware links enable the system to retrieve multimodal evidence that would otherwise be missed
  3. Value of Complete Coverage: Achieving complete modality coverage is crucial for generating high-quality answers

Main Research Directions

  1. Traditional RAG Systems: Primarily text-based, using single retrieval methods such as BM25 and FAISS
  2. Hybrid RAG Frameworks: Combine knowledge graphs with vector retrieval, but KGs are primarily text-centric
  3. Multimodal RAG: Such as Kosmos-1 and MM-ReAct, but mostly operate in closed settings
  4. Knowledge Graph-Enhanced RAG: Improve retrieval diversity through KGs, but lack visual encoding modules

Advantages of This Work

Compared to existing work, MAHA offers the following advantages:

  1. First specifically designed modality-aware KG architecture
  2. Explicitly models cross-modal semantic relationships
  3. Provides fine-grained modality-aware retrieval control
  4. Achieves complete modality coverage and interpretability

Conclusions and Discussion

Main Conclusions

  1. Technical Breakthrough: MAHA successfully addresses limitations of traditional RAG systems in multimodal data processing
  2. Performance Improvement: Achieves significant performance improvements on multiple benchmark datasets, particularly 72% improvement in ROUGE-L metric
  3. Complete Coverage: First to achieve complete modality coverage, demonstrating effectiveness of cross-modal reasoning
  4. Scalability: Provides a scalable and interpretable retrieval framework

Limitations

  1. KG Construction Complexity: Construction of modality-aware knowledge graphs requires specialized parsing and alignment strategies
  2. Computational Overhead: Hybrid retrieval mechanism may increase computational complexity
  3. Domain Adaptability: Adaptation capabilities in specific domains require further verification
  4. Dynamic Updates: Static KGs face challenges in handling dynamic information updates

Future Directions

  1. Automated KG Construction: Develop more advanced automated methods for handling highly unstructured data
  2. Dynamic Query Routing: Implement intelligent routers that adapt to query complexity in real-time
  3. Larger-Scale Evaluation: Validate the method on larger-scale and more diverse datasets
  4. Real-Time Optimization: Optimize system response time to improve practical applicability

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose the concept of modality-aware knowledge graphs, filling an important gap in multimodal RAG
  2. Complete Methodology: End-to-end solution from data ingestion to final generation
  3. Comprehensive Experiments: Thorough evaluation on multiple datasets, including ablation studies
  4. Metric Innovation: Proposes modality coverage as an important evaluation metric
  5. Significant Results: Achieves substantial improvements across all key metrics

Weaknesses

  1. High Complexity: Relatively complex system architecture that may face deployment challenges
  2. Dataset Scale: Evaluation datasets may have limited scale and diversity
  3. Insufficient Error Analysis: Lacks in-depth analysis of failure cases
  4. Computational Cost: Paper does not thoroughly discuss computational resource requirements and efficiency
  5. Generalization Ability: Generalization capability on unseen domains and data types requires further verification

Impact

  1. Academic Value: Provides new research directions and benchmarks for multimodal information retrieval
  2. Practical Value: Has broad application prospects in document analysis, technical support, education, and other fields
  3. Reproducibility: Paper provides detailed implementation details facilitating subsequent research
  4. Inspirational Value: Modality-aware KG concept may inspire research in other multimodal tasks

Applicable Scenarios

  1. Enterprise Document Analysis: Processing financial reports and technical documents containing diagrams and tables
  2. Academic Research Support: Assisting researchers in extracting information from multimodal academic papers
  3. Educational Assistance: Providing cross-modal knowledge question-answering services to students
  4. Medical Document Processing: Analyzing medical reports containing images and tables
  5. Legal Document Review: Processing complex legal documents and evidence materials

References

The paper cites 32 related references, primarily including:

  • RAG Fundamentals: Classical retrieval methods such as BM25, FAISS, SBERT
  • Multimodal Models: CLIP, Kosmos-1, MM-ReAct, etc.
  • Knowledge Graph Methods: Various KG-enhanced RAG frameworks
  • Evaluation Benchmarks: UDA, MRAMG-Bench, REAL-MM-RAG-Bench, etc.

Overall Assessment: This is a high-quality research paper that proposes an innovative solution to the important and challenging problem of multimodal RAG. The MAHA architecture achieves significant technical breakthroughs through modality-aware knowledge graphs and hybrid retrieval strategies, with convincing experimental results. While there remains room for improvement in complexity and generalization capability, this work establishes an important foundation for multimodal information retrieval research and possesses high academic value and practical potential.