2025-11-19T16:58:15.123993

Unified Open-World Segmentation with Multi-Modal Prompts

Liu, Yin, Jing et al.

In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.

academic

Basic Information

Paper ID: 2510.10524
Title: Unified Open-World Segmentation with Multi-Modal Prompts
Authors: Yang Liu, Yufei Yin, Chenchen Jing, Muzhi Zhu, Hao Chen, Yuling Xi, Bo Feng, Hao Wang, Shiyu Li, Chunhua Shen
Category: cs.CV
Publication Date: October 12, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10524

Abstract

This study proposes COSINE, a unified open-world segmentation model that integrates open-vocabulary segmentation and contextual segmentation, supporting multi-modal prompts (such as text and images). COSINE leverages foundation models to extract representations of input images and corresponding multi-modal prompts, and employs SegDecoder to align these representations, model their interactions, and obtain masks specified by input prompts at different granularities. In this manner, COSINE overcomes the limitations of previous open-vocabulary and contextual segmentation pipelines in terms of architectural differences, divergent learning objectives, and representation learning strategy variations. Comprehensive experiments demonstrate significant performance improvements of COSINE on both open-vocabulary and contextual segmentation tasks. Exploratory analysis highlights that the synergistic collaboration between visual and textual prompts significantly enhances generalization capability compared to single-modality approaches.

Research Background and Motivation

Problem Definition

Traditional closed-world segmentation models are limited to recognizing a fixed set of categories encountered during training, whereas open-world segmentation models must locate arbitrary relevant objects in the wild based on user-provided prompts. Current open-world segmentation research primarily revolves around two distinct paradigms:

Open-Vocabulary Segmentation: Replaces learnable classifiers with text embeddings derived from category descriptors, extending traditional closed-set segmentation frameworks to recognize novel categories through natural language alignment
Contextual Segmentation: Leverages contextual cues from exemplar images to achieve adaptive object segmentation in query images

Research Motivation

Existing methods primarily suffer from three core issues:

Architectural Divergence: Different methods adopt fundamentally different architectural designs (e.g., SegGPT uses ViT encoder architecture, while ODISE employs Mask2Former encoder-decoder structure)
Learning Objective Divergence: Open-vocabulary segmentation focuses on image-text semantic alignment, whereas contextual segmentation emphasizes reference-query relationship modeling
Representation Learning Strategy Variation: Open-vocabulary segmentation relies on multi-modal models for category matching, while contextual segmentation primarily uses visual foundation models for object localization

Significance

Unifying these two paradigms is of considerable importance: relying solely on text may result in insufficient fine-grained semantic abstraction, while image-based examples often lack explicit category boundaries and semantic alignment. Integrating both approaches can fully leverage the complementary advantages of textual and visual modalities.

Core Contributions

First Unified Framework: To the authors' knowledge, this is the first method to unify contextual segmentation and open-vocabulary segmentation, proposing the simple yet effective COSINE framework
Significant Performance Improvements: Achieves substantial performance gains on both open-vocabulary and contextual segmentation tasks
Multi-Modal Synergy Insights: Discovers that synergistic collaboration between different modality branches enhances generalization capability in open-world segmentation, providing valuable insights to the research community
Lightweight Design: Effectively unleashes the potential of foundation models in open-world perception by freezing foundation models and training only lightweight decoders

Method Details

Task Definition

COSINE addresses the unified open-world segmentation task with inputs including:

Target image
Multi-modal prompts (textual descriptions or exemplar images)
Output: Segmentation masks at different granularities (semantic, instance, panoptic segmentation, etc.)

Model Architecture

Overall Design

COSINE adopts a simple design philosophy comprising two main components:

Model Pool: Extracts features from target images and different modality prompts
SegDecoder: Decoder-only segmentation model that processes image and prompt features

Model Pool

Visual Models: DINOv2 and CLIP vision encoders
Language Model: CLIP text encoder
Input Processing:
- Target image: Encoded by all visual models into image features $F = \{F_i\}^P_i$
- Visual prompts: Encoded by DINOv2 and pooled with contextual masks into prompt tokens $V = \{v_i\}^M_i$
- Text prompts: Extracted by language model into text features $T = \{t_i\}^N_i$

SegDecoder Architecture

Contains four core modules:

Adapter Suite:
- Feature Blender: Fuses different image features
- V-Adapter and T-Adapter: Align feature dimensions of images and various modality prompts
Image-Prompt Aligner:
```
⟨F', V', T'⟩ = Alignment(F, V, T; θ)
```
Aligns images and different modality prompts through self-attention, cross-attention, and feed-forward networks
Pixel Decoder:
- Single-scale: Two transposed convolution layers achieving 4× upsampling
- Multi-scale: Deformable attention Transformer
Multi-Modality Decoder:
```
⟨Q_r, V_r, T_r⟩ = Decoder(Q, V', T', F', F_mask; φ)
```
Adopts dual-path design promoting interactions between object queries, different modality prompts, and image features through self-attention and cross-attention

Technical Innovations

Unified Representation Space: Converts inputs from different modalities into standardized token sequences, achieving structural unification
Collaborative Training Strategy: Maintains 1:1 sample ratio between image and text prompts during training
Multi-Modal Collaborative Inference: Supports collaborative inference with single-modality and multi-modality prompts, integrating information from different modalities through simple averaging fusion mechanism

Experimental Setup

Datasets

COCO: 118K training images, 5K validation images, supporting multiple segmentation tasks
Objects365: 365 object categories, 638K images, using enhanced Objects365-SAM version
Referring Segmentation Datasets: refCLEF, refCOCO, refCOCO+, refCOCOg
Evaluation Datasets: LVIS, ADE20K, Cityscapes, DAVIS 2017, YouTube-VOS 2019, etc.

Evaluation Metrics

Few-shot Segmentation: mIoU (one-shot and few-shot learning)
Instance Segmentation: AP (all categories) and APr (rare categories)
Panoptic Segmentation: PQ (panoptic quality) and AP
Video Object Segmentation: J&F score
Referring Segmentation: cIoU

Implementation Details

Foundation Models: DINOv2 (ViT-L) and CLIP (ConvNeXt-Large)
Trainable Parameters: 25M for single-scale, 32M for multi-scale
Training Settings: 50K steps, batch size 64, Adam optimizer, learning rate 1e-4
Data Augmentation: Random horizontal flipping and large-scale jittering (LSJ)

Experimental Results

Main Results

Few-shot Semantic Segmentation (LVIS-92i)

One-shot: 35.2 mIoU (vs. Matcher 33.0, SINE 31.2)
Few-shot: 40.7 mIoU (vs. Matcher 40.0, SINE 35.5)

Few-shot Instance Segmentation (LVIS)

AP: 20.3 (significantly outperforming DINOv's 15.4)
APr: 25.8 (excellent performance on rare categories)

Open-Vocabulary Panoptic Segmentation

ADE20K: PQ 31.0, AP 21.1 (outperforming ODISE's 23.4 PQ, 13.9 AP)
Cityscapes: PQ 35.7, AP 15.6 (comparable to SOTA methods)

Open-Vocabulary Semantic Segmentation

A-847: 15.6 mIoU
PC-459: 19.2 mIoU

Ablation Studies

Vision-Text Interaction Effects

Training Phase (10K steps training):

Vision branch only: LVIS-92i one-shot 24.5 mIoU
Text branch only: ADE20K PQ 13.2
Multi-modal joint: Significant performance improvements for both branches

Inference Phase:

Multi-modal collaboration improves LVIS-92i from 35.2 to 43.1 mIoU
Improves ADE20K from 31.0 to 31.4 PQ

Component Contribution Analysis

DINOv2 encoder only: Significant performance degradation on open-vocabulary tasks
CLIP encoder only: Performance degradation on contextual tasks
Removing Feature Blender: Noticeable performance decline
Removing Image-Prompt Aligner: Decline across all metrics

Case Analysis

The paper presents qualitative results across diverse scenarios:

Industrial Inspection: Visual and textual prompts collaboratively segment defects accurately
Medical Imaging: Multi-modal prompts application in complex medical images
General Scenes: Unified handling of segmentation tasks at different granularities

Open-World Segmentation

Open-Vocabulary Segmentation: ODISE, FC-CLIP, OpenSeeD and other methods focus on text-image alignment
Contextual Segmentation: SegGPT, PerSAM, Matcher, DINOv and other methods leverage visual exemplars

Visual Foundation Models

Self-Supervised Learning: MAE, DINOv2 provide powerful visual features
Multi-Modal Learning: CLIP achieves image-text alignment through contrastive learning
Universal Segmentation: SAM achieves category-agnostic zero-shot segmentation

COSINE is the first method to unify open-vocabulary and contextual segmentation, achieving effective integration of both paradigms by freezing foundation models and training lightweight decoders.

Conclusions and Discussion

Main Conclusions

Effectiveness of Unified Framework: COSINE successfully unifies open-vocabulary and contextual segmentation, achieving SOTA performance on multiple tasks
Importance of Multi-Modal Synergy: Collaboration between visual and textual prompts significantly enhances model generalization capability
Advantages of Lightweight Design: By freezing foundation models, COSINE maintains strong performance while significantly reducing training costs

Limitations

Closed-Set Performance Trade-off: To enhance open-world generalization, performance on closed-set scenarios is somewhat reduced (e.g., COCO PQ 50.6 vs. OpenSeeD 59.5)
Model Pool Constraints: Only explores limited combinations of foundation models, lacking in-depth investigation of more advanced MLLMs and diffusion models
Computational Cost: Using multiple foundation models inevitably increases computational overhead

Future Directions

Knowledge Distillation: Distill knowledge from multiple models into a single model to reduce computational costs
More Foundation Models: Explore more advanced foundation models such as MLLMs and diffusion models
Architecture Optimization: Further optimize unified architecture design

In-Depth Evaluation

Strengths

Strong Novelty: First to propose a framework unifying open-vocabulary and contextual segmentation, addressing an important technical problem
Comprehensive Experiments: Thorough evaluation across multiple datasets and tasks, including detailed ablation studies
Clear Technical Contributions: Provides practical solutions through frozen foundation models and lightweight decoder design
In-Depth Analysis: Conducts thorough exploratory analysis of multi-modal synergy effects

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for why multi-modal collaboration is effective
Limited Foundation Model Selection: Insufficient exploration of other possible foundation model combinations
Inadequate Computational Efficiency Analysis: Insufficient detailed analysis of computational overhead introduced by multiple models

Impact

Academic Value: Provides new unified perspective for open-world segmentation, potentially inspiring subsequent research
Practical Value: Lightweight design makes the method practically applicable
Reproducibility: Authors commit to open-sourcing code, facilitating adoption and improvement by research community

Applicable Scenarios

Autonomous Driving: Requires recognition and segmentation of various objects on roads
Interactive Robotics: Requires segmentation based on natural language instructions or visual exemplars
Medical Image Analysis: Combines textual descriptions and visual exemplars for lesion segmentation
Industrial Inspection: Defect detection based on multi-modal prompts

References

The paper cites 73 relevant references covering important works in segmentation, foundation models, multi-modal learning and other domains, providing solid theoretical foundation for the research.

Overall Evaluation: This is a high-quality computer vision paper proposing an innovative unified framework for the important problem of open-world segmentation. Despite certain limitations, its technical contributions are clear, experimental results are convincing, and it has significant implications for advancing the field.