2025-11-19T16:58:15.123993

Unified Open-World Segmentation with Multi-Modal Prompts

Liu, Yin, Jing et al.
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
academic

Unified Open-World Segmentation with Multi-Modal Prompts

Basic Information

  • Paper ID: 2510.10524
  • Title: Unified Open-World Segmentation with Multi-Modal Prompts
  • Authors: Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, Chunhua Shen
  • Category: cs.CV
  • Publication Date: October 12, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10524

Abstract

This study proposes COSINE, a unified open-world segmentation model that integrates open-vocabulary segmentation and contextual segmentation, supporting multi-modal prompts (such as text and images). COSINE leverages foundation models to extract representations of input images and corresponding multi-modal prompts, and employs SegDecoder to align these representations, model their interactions, and obtain masks specified by input prompts at different granularities. In this manner, COSINE overcomes the limitations of previous open-vocabulary and contextual segmentation pipelines in terms of architectural differences, divergent learning objectives, and representation learning strategy variations. Comprehensive experiments demonstrate significant performance improvements of COSINE on both open-vocabulary and contextual segmentation tasks. Exploratory analysis highlights that the synergistic collaboration between visual and textual prompts significantly enhances generalization capability compared to single-modality approaches.

Research Background and Motivation

Problem Definition

Traditional closed-world segmentation models are limited to recognizing a fixed set of categories encountered during training, whereas open-world segmentation models must locate arbitrary relevant objects in the wild based on user-provided prompts. Current open-world segmentation research primarily revolves around two distinct paradigms:

  1. Open-Vocabulary Segmentation: Replaces learnable classifiers with text embeddings derived from category descriptors, extending traditional closed-set segmentation frameworks to recognize novel categories through natural language alignment
  2. Contextual Segmentation: Leverages contextual cues from exemplar images to achieve adaptive object segmentation in query images

Research Motivation

Existing methods primarily suffer from three core issues:

  1. Architectural Divergence: Different methods adopt fundamentally different architectural designs (e.g., SegGPT uses ViT encoder architecture, while ODISE employs Mask2Former encoder-decoder structure)
  2. Learning Objective Divergence: Open-vocabulary segmentation focuses on image-text semantic alignment, whereas contextual segmentation emphasizes reference-query relationship modeling
  3. Representation Learning Strategy Variation: Open-vocabulary segmentation relies on multi-modal models for category matching, while contextual segmentation primarily uses visual foundation models for object localization

Significance

Unifying these two paradigms is of considerable importance: relying solely on text may result in insufficient fine-grained semantic abstraction, while image-based examples often lack explicit category boundaries and semantic alignment. Integrating both approaches can fully leverage the complementary advantages of textual and visual modalities.

Core Contributions

  1. First Unified Framework: To the authors' knowledge, this is the first method to unify contextual segmentation and open-vocabulary segmentation, proposing the simple yet effective COSINE framework
  2. Significant Performance Improvements: Achieves substantial performance gains on both open-vocabulary and contextual segmentation tasks
  3. Multi-Modal Synergy Insights: Discovers that synergistic collaboration between different modality branches enhances generalization capability in open-world segmentation, providing valuable insights to the research community
  4. Lightweight Design: Effectively unleashes the potential of foundation models in open-world perception by freezing foundation models and training only lightweight decoders

Method Details

Task Definition

COSINE addresses the unified open-world segmentation task with inputs including:

  • Target image
  • Multi-modal prompts (textual descriptions or exemplar images)
  • Output: Segmentation masks at different granularities (semantic, instance, panoptic segmentation, etc.)

Model Architecture

Overall Design

COSINE adopts a simple design philosophy comprising two main components:

  1. Model Pool: Extracts features from target images and different modality prompts
  2. SegDecoder: Decoder-only segmentation model that processes image and prompt features

Model Pool

  • Visual Models: DINOv2 and CLIP vision encoders
  • Language Model: CLIP text encoder
  • Input Processing:
    • Target image: Encoded by all visual models into image features F={Fi}iPF = \{F_i\}^P_i
    • Visual prompts: Encoded by DINOv2 and pooled with contextual masks into prompt tokens V={vi}iMV = \{v_i\}^M_i
    • Text prompts: Extracted by language model into text features T={ti}iNT = \{t_i\}^N_i

SegDecoder Architecture

Contains four core modules:

  1. Adapter Suite:
    • Feature Blender: Fuses different image features
    • V-Adapter and T-Adapter: Align feature dimensions of images and various modality prompts
  2. Image-Prompt Aligner:
    ⟨F', V', T'⟩ = Alignment(F, V, T; θ)
    

    Aligns images and different modality prompts through self-attention, cross-attention, and feed-forward networks
  3. Pixel Decoder:
    • Single-scale: Two transposed convolution layers achieving 4× upsampling
    • Multi-scale: Deformable attention Transformer
  4. Multi-Modality Decoder:
    ⟨Q_r, V_r, T_r⟩ = Decoder(Q, V', T', F', F_mask; φ)
    

    Adopts dual-path design promoting interactions between object queries, different modality prompts, and image features through self-attention and cross-attention

Technical Innovations

  1. Unified Representation Space: Converts inputs from different modalities into standardized token sequences, achieving structural unification
  2. Collaborative Training Strategy: Maintains 1:1 sample ratio between image and text prompts during training
  3. Multi-Modal Collaborative Inference: Supports collaborative inference with single-modality and multi-modality prompts, integrating information from different modalities through simple averaging fusion mechanism

Experimental Setup

Datasets

  • COCO: 118K training images, 5K validation images, supporting multiple segmentation tasks
  • Objects365: 365 object categories, 638K images, using enhanced Objects365-SAM version
  • Referring Segmentation Datasets: refCLEF, refCOCO, refCOCO+, refCOCOg
  • Evaluation Datasets: LVIS, ADE20K, Cityscapes, DAVIS 2017, YouTube-VOS 2019, etc.

Evaluation Metrics

  • Few-shot Segmentation: mIoU (one-shot and few-shot learning)
  • Instance Segmentation: AP (all categories) and APr (rare categories)
  • Panoptic Segmentation: PQ (panoptic quality) and AP
  • Video Object Segmentation: J&F score
  • Referring Segmentation: cIoU

Implementation Details

  • Foundation Models: DINOv2 (ViT-L) and CLIP (ConvNeXt-Large)
  • Trainable Parameters: 25M for single-scale, 32M for multi-scale
  • Training Settings: 50K steps, batch size 64, Adam optimizer, learning rate 1e-4
  • Data Augmentation: Random horizontal flipping and large-scale jittering (LSJ)

Experimental Results

Main Results

Few-shot Semantic Segmentation (LVIS-92i)

  • One-shot: 35.2 mIoU (vs. Matcher 33.0, SINE 31.2)
  • Few-shot: 40.7 mIoU (vs. Matcher 40.0, SINE 35.5)

Few-shot Instance Segmentation (LVIS)

  • AP: 20.3 (significantly outperforming DINOv's 15.4)
  • APr: 25.8 (excellent performance on rare categories)

Open-Vocabulary Panoptic Segmentation

  • ADE20K: PQ 31.0, AP 21.1 (outperforming ODISE's 23.4 PQ, 13.9 AP)
  • Cityscapes: PQ 35.7, AP 15.6 (comparable to SOTA methods)

Open-Vocabulary Semantic Segmentation

  • A-847: 15.6 mIoU
  • PC-459: 19.2 mIoU

Ablation Studies

Vision-Text Interaction Effects

Training Phase (10K steps training):

  • Vision branch only: LVIS-92i one-shot 24.5 mIoU
  • Text branch only: ADE20K PQ 13.2
  • Multi-modal joint: Significant performance improvements for both branches

Inference Phase:

  • Multi-modal collaboration improves LVIS-92i from 35.2 to 43.1 mIoU
  • Improves ADE20K from 31.0 to 31.4 PQ

Component Contribution Analysis

  • DINOv2 encoder only: Significant performance degradation on open-vocabulary tasks
  • CLIP encoder only: Performance degradation on contextual tasks
  • Removing Feature Blender: Noticeable performance decline
  • Removing Image-Prompt Aligner: Decline across all metrics

Case Analysis

The paper presents qualitative results across diverse scenarios:

  • Industrial Inspection: Visual and textual prompts collaboratively segment defects accurately
  • Medical Imaging: Multi-modal prompts application in complex medical images
  • General Scenes: Unified handling of segmentation tasks at different granularities

Open-World Segmentation

  • Open-Vocabulary Segmentation: ODISE, FC-CLIP, OpenSeeD and other methods focus on text-image alignment
  • Contextual Segmentation: SegGPT, PerSAM, Matcher, DINOv and other methods leverage visual exemplars

Visual Foundation Models

  • Self-Supervised Learning: MAE, DINOv2 provide powerful visual features
  • Multi-Modal Learning: CLIP achieves image-text alignment through contrastive learning
  • Universal Segmentation: SAM achieves category-agnostic zero-shot segmentation

COSINE is the first method to unify open-vocabulary and contextual segmentation, achieving effective integration of both paradigms by freezing foundation models and training lightweight decoders.

Conclusions and Discussion

Main Conclusions

  1. Effectiveness of Unified Framework: COSINE successfully unifies open-vocabulary and contextual segmentation, achieving SOTA performance on multiple tasks
  2. Importance of Multi-Modal Synergy: Collaboration between visual and textual prompts significantly enhances model generalization capability
  3. Advantages of Lightweight Design: By freezing foundation models, COSINE maintains strong performance while significantly reducing training costs

Limitations

  1. Closed-Set Performance Trade-off: To enhance open-world generalization, performance on closed-set scenarios is somewhat reduced (e.g., COCO PQ 50.6 vs. OpenSeeD 59.5)
  2. Model Pool Constraints: Only explores limited combinations of foundation models, lacking in-depth investigation of more advanced MLLMs and diffusion models
  3. Computational Cost: Using multiple foundation models inevitably increases computational overhead

Future Directions

  1. Knowledge Distillation: Distill knowledge from multiple models into a single model to reduce computational costs
  2. More Foundation Models: Explore more advanced foundation models such as MLLMs and diffusion models
  3. Architecture Optimization: Further optimize unified architecture design

In-Depth Evaluation

Strengths

  1. Strong Novelty: First to propose a framework unifying open-vocabulary and contextual segmentation, addressing an important technical problem
  2. Comprehensive Experiments: Thorough evaluation across multiple datasets and tasks, including detailed ablation studies
  3. Clear Technical Contributions: Provides practical solutions through frozen foundation models and lightweight decoder design
  4. In-Depth Analysis: Conducts thorough exploratory analysis of multi-modal synergy effects

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for why multi-modal collaboration is effective
  2. Limited Foundation Model Selection: Insufficient exploration of other possible foundation model combinations
  3. Inadequate Computational Efficiency Analysis: Insufficient detailed analysis of computational overhead introduced by multiple models

Impact

  1. Academic Value: Provides new unified perspective for open-world segmentation, potentially inspiring subsequent research
  2. Practical Value: Lightweight design makes the method practically applicable
  3. Reproducibility: Authors commit to open-sourcing code, facilitating adoption and improvement by research community

Applicable Scenarios

  • Autonomous Driving: Requires recognition and segmentation of various objects on roads
  • Interactive Robotics: Requires segmentation based on natural language instructions or visual exemplars
  • Medical Image Analysis: Combines textual descriptions and visual exemplars for lesion segmentation
  • Industrial Inspection: Defect detection based on multi-modal prompts

References

The paper cites 73 relevant references covering important works in segmentation, foundation models, multi-modal learning and other domains, providing solid theoretical foundation for the research.


Overall Evaluation: This is a high-quality computer vision paper proposing an innovative unified framework for the important problem of open-world segmentation. Despite certain limitations, its technical contributions are clear, experimental results are convincing, and it has significant implications for advancing the field.