2025-11-18T12:22:13.890784

DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Xie, Liang, Li et al.
Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.
academic

DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Basic Information

  • Paper ID: 2504.08307
  • Title: DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
  • Authors: Qinghongbing Xie, Zijian Liang, Fuhao Li, Long Zeng (Tsinghua University Shenzhen International Graduate School)
  • Classification: cs.CV cs.RO
  • Publication Date/Venue: arXiv 2025 (Under Submission)
  • Paper Link: https://arxiv.org/abs/2504.08307
  • Project Homepage: https://binicey.github.io/DSM/

Abstract

Effective scene representation is crucial for visual grounding capabilities; however, existing 3D visual grounding methods have significant limitations. They either focus solely on geometric and visual cues or, like traditional 3D scene graphs, lack the multidimensional attributes necessary for complex reasoning. To address this gap, this paper introduces the Diverse Semantic Map (DSM) framework, a novel scene representation approach that enriches robust geometric models with VLM-derived semantics encompassing appearance, physical properties, and functionality. DSM is constructed online by fusing multi-view observations within temporal sliding windows, creating persistent and comprehensive world models. Building upon this foundation, we propose DSM-Grounding, a novel paradigm that transforms grounding from free-form VLM queries into structured reasoning processes on semantically-rich maps, significantly improving accuracy and interpretability.

Research Background and Motivation

Problems to Address

Existing 3D visual grounding methods face two primary limitations:

  1. Insufficient Semantic Representation: Most methods focus only on geometric and visual cues, neglecting intrinsic object attributes and contextual interdependencies
  2. Limited Reasoning Capabilities: Traditional 3D scene graphs can only capture simple semantics and struggle to support complex reasoning by large models in intricate environments

Problem Significance

For applications such as service robots, merely identifying objects is insufficient; understanding multidimensional object attributes (e.g., color, freshness, weight, location) and their complex relationships is critical for executing sophisticated tasks.

Limitations of Existing Methods

  1. Geometry-Oriented Approaches: Such as view selection optimization, primarily focus on geometric and visual features while lacking semantic understanding
  2. Traditional 3D Scene Graphs: Concentrate on simple semantics and spatial relationships, lacking fine-grained multidimensional attributes
  3. Direct VLM Queries: Perform poorly in complex spatial and relational reasoning, constrained by input format limitations

Research Motivation

Construct a scene representation that is both expressive (encoding rich information) and compact (ensuring cross-platform adaptability), supporting complex multidimensional reasoning.

Core Contributions

  1. Proposes DSM Framework: A novel framework capable of supporting complex multidimensional scene representation, integrating semantic understanding with precise localization
  2. Develops Temporal Window Mapping Method: An online construction method integrating geometric and semantic awareness for building DSM components with rich semantics
  3. Proposes DSM-Grounding: A novel 3D grounding method leveraging DSM for deeper scene reasoning

Methodology Details

Task Definition

Input: Continuous stream of RGB-D observations, natural language queries Output: 3D position and bounding box of target objects Constraints: Zero-shot setting without pre-trained category-specific labels

DSM Definition

DSM is defined as a 3D scene graph G=(O,R), where:

  • O: Set of object nodes
  • R: Set of edges representing relationships

Each object node O_i ∈ O contains:

Geometric Representation (O_g^i):

  • 3D point cloud P_i
  • Oriented bounding box B_i

Semantic Representation (O_s^i):

  • Identity N_i: Category labels or names
  • Attributes A_i: Structured VLM-derived descriptions
    • Appearance attributes (a_a): Color, pattern, texture
    • Physical attributes (a_p): Weight, material, surface properties
    • Functional attributes (a_o): Purpose, operation methods

DSM Construction Pipeline

1. Single-View Parsing

For each RGB-D frame:

  • Object Detection and Segmentation: Open-vocabulary detection using YoloWorld, segmentation using SAM2
  • Point Cloud Generation: Back-projection of 2D masks through depth and camera pose information
  • Semantic Extraction: Generation of structured semantic descriptions using VLM and chain-of-thought reasoning

2. Multi-View Mapping

Multimodal Data Association: Computing weighted similarity scores

S = s_v + s_g + s_c
s_v = CosSimilarity(f_vp̂, f_vq̂)  # Visual similarity
s_g = IoU(bbox_p, bbox_q)         # Geometric similarity  
s_c = CosSimilarity(f_sp̂, f_sq̂)  # Semantic similarity

Geometric Sliding Window Method:

  • Construct view frustum for each frame
  • Aggregate recent point cloud observations
  • Apply spatial voting scheme to filter noise and complete shapes

DSM-Grounding Method

1. Candidate Retrieval

Parse natural language queries using LLM to identify target entities, anchor entities, and their attributes; retrieve initial candidate sets from DSM through text matching.

2. Latent Relationship Filtering (LRF)

Verify relationship constraints described in queries:

  • Query relationships R stored in DSM
  • Use LLM to score consistency between stored relationships and query relationships
  • Select Top-k candidates, producing refined set O_filtered

3. Multi-Level Verification

Render images from three viewpoints for the final candidate set:

  • Object Level: Object fills the frame, providing detailed category and attribute information
  • Location Level: Broader view showing object relationships with adjacent regions
  • Scene Level: Global context information containing nearly the entire scene

Final decision:

pred = VLM(I, O_filtered, Q)

Experimental Setup

Datasets

  • ScanRefer: 8 scenes including living rooms, dining rooms, studies, bedrooms, etc.
  • Nr3D/Sr3D: Reporting Overall, Easy, Hard, View-dependent, View-independent metrics
  • AI2-THOR: High-fidelity simulator environment
  • Replica: Large-scale indoor environment dataset

Evaluation Metrics

  • 3D Visual Grounding: Acc@0.25, Acc@0.5 (IoU thresholds)
  • Semantic Segmentation: mAcc (mean accuracy), F-mIoU (foreground mean IoU)

Implementation Details

  • Detection Model: YoloWorld
  • Segmentation Model: SAM2
  • Encoders: SigLip (text), DINOv2 (vision)
  • VLM: GPT-4o-mini, Qwen2.5-VL-7B/72B
  • Threshold Settings: t_v=0.4, t_x=0.8, t_g=0.3, T=1.5

Experimental Results

Main Results

3D Semantic Segmentation (Replica Dataset)

MethodmAccF-mIoU
LSeg (Privileged)33.3951.54
OpenSeg (Privileged)41.1953.74
ConceptFusion (Zero-shot)31.5338.70
ConceptGraphs (Zero-shot)40.6335.95
Ours38.7667.93

3D Visual Grounding (ScanRefer Dataset)

Best results using Qwen2.5-VL-72B:

  • Overall Acc@0.5: 59.06% (SOTA, surpassing existing methods by ~10%)
  • Multiple Acc@0.5: 53.65% (Outstanding performance in multi-object scenes)

Ablation Study (AI2-THOR Dataset)

LRFAppearancePhysicalFunctionalOverall Acc@0.5
60.00
53.64 (-6.36)
49.55
49.09
48.41

Key Findings:

  1. LRF module contributes most significantly (approximately 6-7 percentage point improvement)
  2. Appearance attributes provide the most important signal
  3. All three semantic attribute types contribute positively

Robotic Experiments

Simulated Environment: Significantly outperforms existing zero-shot methods in AI2-THOR Real Environment: Successfully deployed on physical robots for:

  • Semantic navigation tasks: "Navigate to the central room next to the computer desk"
  • Semantic grasping tasks: "Grasp the apple on the white shelf on the white cabinet"

3D Scene Representation

  • Early Methods: Kimera and others focused on metric-semantic mapping
  • Open-Vocabulary Mapping: ConceptFusion creates language-grounded 3D maps
  • 3D Scene Graphs: SceneGraphFusion, Hydra construct hierarchical representations
  • Our Advantage: DSM provides richer multidimensional attribute representation

3D Visual Grounding

  • Open-Vocabulary Methods: OpenScene, NuGrounding achieve grounding through feature alignment
  • VLM Methods: SeeGround, ScanReason employ rendering-prompting strategies
  • Our Innovation: Transition from direct VLM queries to structured reasoning processes

Conclusions and Discussion

Main Conclusions

  1. DSM framework successfully combines geometric precision with semantic richness
  2. Multidimensional semantic attributes (appearance, physical, functional) significantly enhance grounding performance
  3. Structured reasoning paradigm outperforms direct VLM query methods
  4. Method demonstrates strong performance in both simulated and real environments

Limitations

  1. Upstream Module Dependency: Performance affected by object detection and segmentation quality
  2. Computational Latency: Inference time of large-scale VLMs is considerable
  3. Environmental Adaptability: Primarily tested in indoor environments; outdoor applicability remains unknown

Future Directions

  1. Explore more efficient models to enhance real-time performance
  2. Investigate alternative 3D representation methods to improve robustness
  3. Extend to more complex outdoor environments

In-Depth Evaluation

Strengths

  1. Strong Methodological Innovation: First systematic integration of multidimensional semantic attributes into 3D scene representation
  2. Complete Technical Solution: End-to-end solution from scene construction to grounding reasoning
  3. Comprehensive Experiments: Covers multiple datasets, ablation studies, and real robot validation
  4. Significant Performance Gains: Achieves SOTA on multiple benchmarks, particularly notable F-mIoU improvement

Weaknesses

  1. Computational Complexity: Requires multiple VLM calls, potentially impacting real-time applications
  2. Evaluation Limitations: Primarily evaluated in indoor scenes, lacking large-scale outdoor validation
  3. Strong Dependencies: Highly dependent on VLM quality, potentially subject to model biases
  4. Memory Requirements: Storing rich semantic information may impose memory constraints

Impact

  1. Academic Contribution: Provides new research direction for 3D scene understanding
  2. Practical Value: Directly applicable to service robots and similar real-world applications
  3. Reproducibility: Provides detailed implementation details and project homepage

Applicable Scenarios

  1. Indoor Service Robots: Navigation and manipulation in home and office environments
  2. Augmented Reality Applications: AR systems requiring rich semantic understanding
  3. Intelligent Surveillance: Semantic scene understanding and anomaly detection
  4. Assistive Technology: Environmental description for visually impaired individuals

References

The paper cites 40 relevant references covering multiple domains including 3D scene representation, visual grounding, and robotics, providing readers with comprehensive background knowledge.


Overall Assessment: This is a high-quality research paper presenting an innovative solution in 3D visual grounding. The DSM framework successfully combines geometric precision with semantic richness, providing strong technical support for robot understanding and interaction in complex environments. Despite certain computational and applicability limitations, its technical innovation and experimental validation demonstrate excellence, making significant contributions to the field's advancement.