2025-11-18T12:22:13.890784

DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Xie, Liang, Li et al.

Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.

academic

DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Basic Information

Paper ID: 2504.08307
Title: DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
Authors: Qinghongbing Xie, Zijian Liang, Fuhao Li, Long Zeng (Tsinghua University Shenzhen International Graduate School)
Classification: cs.CV cs.RO
Publication Date/Venue: arXiv 2025 (Under Submission)
Paper Link: https://arxiv.org/abs/2504.08307
Project Homepage: https://binicey.github.io/DSM/

Abstract

Effective scene representation is crucial for visual grounding capabilities; however, existing 3D visual grounding methods have significant limitations. They either focus solely on geometric and visual cues or, like traditional 3D scene graphs, lack the multidimensional attributes necessary for complex reasoning. To address this gap, this paper introduces the Diverse Semantic Map (DSM) framework, a novel scene representation approach that enriches robust geometric models with VLM-derived semantics encompassing appearance, physical properties, and functionality. DSM is constructed online by fusing multi-view observations within temporal sliding windows, creating persistent and comprehensive world models. Building upon this foundation, we propose DSM-Grounding, a novel paradigm that transforms grounding from free-form VLM queries into structured reasoning processes on semantically-rich maps, significantly improving accuracy and interpretability.

Research Background and Motivation

Problems to Address

Existing 3D visual grounding methods face two primary limitations:

Insufficient Semantic Representation: Most methods focus only on geometric and visual cues, neglecting intrinsic object attributes and contextual interdependencies
Limited Reasoning Capabilities: Traditional 3D scene graphs can only capture simple semantics and struggle to support complex reasoning by large models in intricate environments

Problem Significance

For applications such as service robots, merely identifying objects is insufficient; understanding multidimensional object attributes (e.g., color, freshness, weight, location) and their complex relationships is critical for executing sophisticated tasks.

Limitations of Existing Methods

Geometry-Oriented Approaches: Such as view selection optimization, primarily focus on geometric and visual features while lacking semantic understanding
Traditional 3D Scene Graphs: Concentrate on simple semantics and spatial relationships, lacking fine-grained multidimensional attributes
Direct VLM Queries: Perform poorly in complex spatial and relational reasoning, constrained by input format limitations

Research Motivation

Construct a scene representation that is both expressive (encoding rich information) and compact (ensuring cross-platform adaptability), supporting complex multidimensional reasoning.

Core Contributions

Proposes DSM Framework: A novel framework capable of supporting complex multidimensional scene representation, integrating semantic understanding with precise localization
Develops Temporal Window Mapping Method: An online construction method integrating geometric and semantic awareness for building DSM components with rich semantics
Proposes DSM-Grounding: A novel 3D grounding method leveraging DSM for deeper scene reasoning

Methodology Details

Task Definition

Input: Continuous stream of RGB-D observations, natural language queries Output: 3D position and bounding box of target objects Constraints: Zero-shot setting without pre-trained category-specific labels

DSM Definition

DSM is defined as a 3D scene graph G=(O,R), where:

O: Set of object nodes
R: Set of edges representing relationships

Each object node O_i ∈ O contains:

Geometric Representation (O_g^i):

3D point cloud P_i
Oriented bounding box B_i

Semantic Representation (O_s^i):

Identity N_i: Category labels or names
Attributes A_i: Structured VLM-derived descriptions
- Appearance attributes (a_a): Color, pattern, texture
- Physical attributes (a_p): Weight, material, surface properties
- Functional attributes (a_o): Purpose, operation methods

DSM Construction Pipeline

1. Single-View Parsing

For each RGB-D frame:

Object Detection and Segmentation: Open-vocabulary detection using YoloWorld, segmentation using SAM2
Point Cloud Generation: Back-projection of 2D masks through depth and camera pose information
Semantic Extraction: Generation of structured semantic descriptions using VLM and chain-of-thought reasoning

2. Multi-View Mapping

Multimodal Data Association: Computing weighted similarity scores

S = s_v + s_g + s_c
s_v = CosSimilarity(f_vp̂, f_vq̂)  # Visual similarity
s_g = IoU(bbox_p, bbox_q)         # Geometric similarity  
s_c = CosSimilarity(f_sp̂, f_sq̂)  # Semantic similarity

Geometric Sliding Window Method:

Construct view frustum for each frame
Aggregate recent point cloud observations
Apply spatial voting scheme to filter noise and complete shapes

DSM-Grounding Method

1. Candidate Retrieval

Parse natural language queries using LLM to identify target entities, anchor entities, and their attributes; retrieve initial candidate sets from DSM through text matching.

2. Latent Relationship Filtering (LRF)

Verify relationship constraints described in queries:

Query relationships R stored in DSM
Use LLM to score consistency between stored relationships and query relationships
Select Top-k candidates, producing refined set O_filtered

3. Multi-Level Verification

Render images from three viewpoints for the final candidate set:

Object Level: Object fills the frame, providing detailed category and attribute information
Location Level: Broader view showing object relationships with adjacent regions
Scene Level: Global context information containing nearly the entire scene

Final decision:

pred = VLM(I, O_filtered, Q)

Experimental Setup

Datasets

ScanRefer: 8 scenes including living rooms, dining rooms, studies, bedrooms, etc.
Nr3D/Sr3D: Reporting Overall, Easy, Hard, View-dependent, View-independent metrics
AI2-THOR: High-fidelity simulator environment
Replica: Large-scale indoor environment dataset

Evaluation Metrics

3D Visual Grounding: Acc@0.25, Acc@0.5 (IoU thresholds)
Semantic Segmentation: mAcc (mean accuracy), F-mIoU (foreground mean IoU)

Implementation Details

Detection Model: YoloWorld
Segmentation Model: SAM2
Encoders: SigLip (text), DINOv2 (vision)
VLM: GPT-4o-mini, Qwen2.5-VL-7B/72B
Threshold Settings: t_v=0.4, t_x=0.8, t_g=0.3, T=1.5

Experimental Results

Main Results

3D Semantic Segmentation (Replica Dataset)

Method	mAcc	F-mIoU
LSeg (Privileged)	33.39	51.54
OpenSeg (Privileged)	41.19	53.74
ConceptFusion (Zero-shot)	31.53	38.70
ConceptGraphs (Zero-shot)	40.63	35.95
Ours	38.76	67.93

3D Visual Grounding (ScanRefer Dataset)

Best results using Qwen2.5-VL-72B:

Overall Acc@0.5: 59.06% (SOTA, surpassing existing methods by ~10%)
Multiple Acc@0.5: 53.65% (Outstanding performance in multi-object scenes)

Ablation Study (AI2-THOR Dataset)

LRF	Appearance	Physical	Functional	Overall Acc@0.5
✓	✓	✓	✓	60.00
✗	✓	✓	✓	53.64 (-6.36)
✗	✓	✗	✗	49.55
✗	✗	✓	✗	49.09
✗	✗	✗	✓	48.41

Key Findings:

LRF module contributes most significantly (approximately 6-7 percentage point improvement)
Appearance attributes provide the most important signal
All three semantic attribute types contribute positively

Robotic Experiments

Simulated Environment: Significantly outperforms existing zero-shot methods in AI2-THOR Real Environment: Successfully deployed on physical robots for:

Semantic navigation tasks: "Navigate to the central room next to the computer desk"
Semantic grasping tasks: "Grasp the apple on the white shelf on the white cabinet"

3D Scene Representation

Early Methods: Kimera and others focused on metric-semantic mapping
Open-Vocabulary Mapping: ConceptFusion creates language-grounded 3D maps
3D Scene Graphs: SceneGraphFusion, Hydra construct hierarchical representations
Our Advantage: DSM provides richer multidimensional attribute representation

3D Visual Grounding

Open-Vocabulary Methods: OpenScene, NuGrounding achieve grounding through feature alignment
VLM Methods: SeeGround, ScanReason employ rendering-prompting strategies
Our Innovation: Transition from direct VLM queries to structured reasoning processes

Conclusions and Discussion

Main Conclusions

DSM framework successfully combines geometric precision with semantic richness
Multidimensional semantic attributes (appearance, physical, functional) significantly enhance grounding performance
Structured reasoning paradigm outperforms direct VLM query methods
Method demonstrates strong performance in both simulated and real environments

Limitations

Upstream Module Dependency: Performance affected by object detection and segmentation quality
Computational Latency: Inference time of large-scale VLMs is considerable
Environmental Adaptability: Primarily tested in indoor environments; outdoor applicability remains unknown

Future Directions

Explore more efficient models to enhance real-time performance
Investigate alternative 3D representation methods to improve robustness
Extend to more complex outdoor environments

In-Depth Evaluation

Strengths

Strong Methodological Innovation: First systematic integration of multidimensional semantic attributes into 3D scene representation
Complete Technical Solution: End-to-end solution from scene construction to grounding reasoning
Comprehensive Experiments: Covers multiple datasets, ablation studies, and real robot validation
Significant Performance Gains: Achieves SOTA on multiple benchmarks, particularly notable F-mIoU improvement

Weaknesses

Computational Complexity: Requires multiple VLM calls, potentially impacting real-time applications
Evaluation Limitations: Primarily evaluated in indoor scenes, lacking large-scale outdoor validation
Strong Dependencies: Highly dependent on VLM quality, potentially subject to model biases
Memory Requirements: Storing rich semantic information may impose memory constraints

Impact

Academic Contribution: Provides new research direction for 3D scene understanding
Practical Value: Directly applicable to service robots and similar real-world applications
Reproducibility: Provides detailed implementation details and project homepage

Applicable Scenarios

Indoor Service Robots: Navigation and manipulation in home and office environments
Augmented Reality Applications: AR systems requiring rich semantic understanding
Intelligent Surveillance: Semantic scene understanding and anomaly detection
Assistive Technology: Environmental description for visually impaired individuals

References

The paper cites 40 relevant references covering multiple domains including 3D scene representation, visual grounding, and robotics, providing readers with comprehensive background knowledge.

Overall Assessment: This is a high-quality research paper presenting an innovative solution in 3D visual grounding. The DSM framework successfully combines geometric precision with semantic richness, providing strong technical support for robot understanding and interaction in complex environments. Despite certain computational and applicability limitations, its technical innovation and experimental validation demonstrate excellence, making significant contributions to the field's advancement.