DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
Xie, Liang, Li et al.
Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.
academic
DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
Effective scene representation is crucial for visual grounding capabilities; however, existing 3D visual grounding methods have significant limitations. They either focus solely on geometric and visual cues or, like traditional 3D scene graphs, lack the multidimensional attributes necessary for complex reasoning. To address this gap, this paper introduces the Diverse Semantic Map (DSM) framework, a novel scene representation approach that enriches robust geometric models with VLM-derived semantics encompassing appearance, physical properties, and functionality. DSM is constructed online by fusing multi-view observations within temporal sliding windows, creating persistent and comprehensive world models. Building upon this foundation, we propose DSM-Grounding, a novel paradigm that transforms grounding from free-form VLM queries into structured reasoning processes on semantically-rich maps, significantly improving accuracy and interpretability.
Existing 3D visual grounding methods face two primary limitations:
Insufficient Semantic Representation: Most methods focus only on geometric and visual cues, neglecting intrinsic object attributes and contextual interdependencies
Limited Reasoning Capabilities: Traditional 3D scene graphs can only capture simple semantics and struggle to support complex reasoning by large models in intricate environments
For applications such as service robots, merely identifying objects is insufficient; understanding multidimensional object attributes (e.g., color, freshness, weight, location) and their complex relationships is critical for executing sophisticated tasks.
Geometry-Oriented Approaches: Such as view selection optimization, primarily focus on geometric and visual features while lacking semantic understanding
Traditional 3D Scene Graphs: Concentrate on simple semantics and spatial relationships, lacking fine-grained multidimensional attributes
Direct VLM Queries: Perform poorly in complex spatial and relational reasoning, constrained by input format limitations
Construct a scene representation that is both expressive (encoding rich information) and compact (ensuring cross-platform adaptability), supporting complex multidimensional reasoning.
Proposes DSM Framework: A novel framework capable of supporting complex multidimensional scene representation, integrating semantic understanding with precise localization
Develops Temporal Window Mapping Method: An online construction method integrating geometric and semantic awareness for building DSM components with rich semantics
Proposes DSM-Grounding: A novel 3D grounding method leveraging DSM for deeper scene reasoning
Input: Continuous stream of RGB-D observations, natural language queries
Output: 3D position and bounding box of target objects
Constraints: Zero-shot setting without pre-trained category-specific labels
Parse natural language queries using LLM to identify target entities, anchor entities, and their attributes; retrieve initial candidate sets from DSM through text matching.
The paper cites 40 relevant references covering multiple domains including 3D scene representation, visual grounding, and robotics, providing readers with comprehensive background knowledge.
Overall Assessment: This is a high-quality research paper presenting an innovative solution in 3D visual grounding. The DSM framework successfully combines geometric precision with semantic richness, providing strong technical support for robot understanding and interaction in complex environments. Despite certain computational and applicability limitations, its technical innovation and experimental validation demonstrate excellence, making significant contributions to the field's advancement.