2025-11-25T09:16:18.025021

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Dai, Cheng, Liu et al.

Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.

academic

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

基本信息

论文ID: 2507.01738
标题: DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
作者: Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang
机构: Southeast University, Baidu VIS, Stanford University
分类: cs.CV
发表时间: 2025年10月13日 (arXiv v2)
论文链接: https://arxiv.org/abs/2507.01738v2

摘要

指代图像分割(RIS)是一项具有挑战性的任务，旨在基于自然语言表达对图像中的目标进行分割。虽然先前的研究主要集中在改善视觉-语言交互和实现细粒度定位，但对现有RIS框架中基本瓶颈的系统性分析仍然不足。为了填补这一空白，本文提出了DeRIS，一个将RIS分解为两个关键组件的新框架：感知(perception)和认知(cognition)。这种模块化分解促进了对阻碍RIS性能的主要瓶颈的系统性分析。研究发现，主要限制不在于感知缺陷，而在于当前模型的多模态认知能力不足。为了缓解这一问题，提出了环回协同(Loopback Synergy)机制，增强感知和认知模块之间的协同作用，从而实现精确分割并同时改善鲁棒的图像-文本理解。

感知中心方法(Perception-centric)：依赖分层骨干网络保留细粒度空间信息，但由于下游数据集多样性有限，多模态融合模块的内容认知能力较弱
认知中心方法(Cognition-centric)：利用大规模视觉-语言预训练模型增强多模态理解，但由于Transformer架构的二次计算复杂度，在高分辨率输入下会丢失细粒度空间信息