Unified Open-World Segmentation with Multi-Modal Prompts
Liu, Yin, Jing et al.
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
academic
Unified Open-World Segmentation with Multi-Modal Prompts
This study proposes COSINE, a unified open-world segmentation model that integrates open-vocabulary segmentation and contextual segmentation, supporting multi-modal prompts (such as text and images). COSINE leverages foundation models to extract representations of input images and corresponding multi-modal prompts, and employs SegDecoder to align these representations, model their interactions, and obtain masks specified by input prompts at different granularities. In this manner, COSINE overcomes the limitations of previous open-vocabulary and contextual segmentation pipelines in terms of architectural differences, divergent learning objectives, and representation learning strategy variations. Comprehensive experiments demonstrate significant performance improvements of COSINE on both open-vocabulary and contextual segmentation tasks. Exploratory analysis highlights that the synergistic collaboration between visual and textual prompts significantly enhances generalization capability compared to single-modality approaches.
Traditional closed-world segmentation models are limited to recognizing a fixed set of categories encountered during training, whereas open-world segmentation models must locate arbitrary relevant objects in the wild based on user-provided prompts. Current open-world segmentation research primarily revolves around two distinct paradigms:
Open-Vocabulary Segmentation: Replaces learnable classifiers with text embeddings derived from category descriptors, extending traditional closed-set segmentation frameworks to recognize novel categories through natural language alignment
Contextual Segmentation: Leverages contextual cues from exemplar images to achieve adaptive object segmentation in query images
Representation Learning Strategy Variation: Open-vocabulary segmentation relies on multi-modal models for category matching, while contextual segmentation primarily uses visual foundation models for object localization
Unifying these two paradigms is of considerable importance: relying solely on text may result in insufficient fine-grained semantic abstraction, while image-based examples often lack explicit category boundaries and semantic alignment. Integrating both approaches can fully leverage the complementary advantages of textual and visual modalities.
First Unified Framework: To the authors' knowledge, this is the first method to unify contextual segmentation and open-vocabulary segmentation, proposing the simple yet effective COSINE framework
Significant Performance Improvements: Achieves substantial performance gains on both open-vocabulary and contextual segmentation tasks
Multi-Modal Synergy Insights: Discovers that synergistic collaboration between different modality branches enhances generalization capability in open-world segmentation, providing valuable insights to the research community
Lightweight Design: Effectively unleashes the potential of foundation models in open-world perception by freezing foundation models and training only lightweight decoders
Adopts dual-path design promoting interactions between object queries, different modality prompts, and image features through self-attention and cross-attention
Unified Representation Space: Converts inputs from different modalities into standardized token sequences, achieving structural unification
Collaborative Training Strategy: Maintains 1:1 sample ratio between image and text prompts during training
Multi-Modal Collaborative Inference: Supports collaborative inference with single-modality and multi-modality prompts, integrating information from different modalities through simple averaging fusion mechanism
COSINE is the first method to unify open-vocabulary and contextual segmentation, achieving effective integration of both paradigms by freezing foundation models and training lightweight decoders.
Effectiveness of Unified Framework: COSINE successfully unifies open-vocabulary and contextual segmentation, achieving SOTA performance on multiple tasks
Importance of Multi-Modal Synergy: Collaboration between visual and textual prompts significantly enhances model generalization capability
Advantages of Lightweight Design: By freezing foundation models, COSINE maintains strong performance while significantly reducing training costs
Closed-Set Performance Trade-off: To enhance open-world generalization, performance on closed-set scenarios is somewhat reduced (e.g., COCO PQ 50.6 vs. OpenSeeD 59.5)
Model Pool Constraints: Only explores limited combinations of foundation models, lacking in-depth investigation of more advanced MLLMs and diffusion models
Computational Cost: Using multiple foundation models inevitably increases computational overhead
The paper cites 73 relevant references covering important works in segmentation, foundation models, multi-modal learning and other domains, providing solid theoretical foundation for the research.
Overall Evaluation: This is a high-quality computer vision paper proposing an innovative unified framework for the important problem of open-world segmentation. Despite certain limitations, its technical contributions are clear, experimental results are convincing, and it has significant implications for advancing the field.