Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
This paper proposes a novel video understanding task—Referring Multi-Object Tracking (RMOT)—which aims to guide multi-object tracking predictions through natural language expressions as semantic cues, comprehensively considering target quantity variations and temporal semantics. The paper constructs the Refer-KITTI-V2 benchmark dataset containing 9,758 diverse language expressions and proposes the TempRMOT framework, which achieves long-term spatiotemporal interaction through a query-driven temporal enhancement module. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2.
Existing referring understanding tasks have two core limitations:
Single-target constraint: Existing datasets (e.g., RefCOCO series, Refer-DAVIS17) annotate only a single target per expression, while in real scenarios, one expression may refer to multiple, single, or zero targets
Temporal consistency absence: Existing methods cannot model temporal consistency between language expressions and target evolution states. For example, the expression "a car turning" describes an instantaneous state, but annotations would continue tracking the target even after the turning action is completed
Language-guided video understanding is a key task connecting natural language with visual content
In practical applications such as autonomous driving, it is necessary to simultaneously track multiple dynamic targets through natural language instructions
Accurately modeling temporal dynamics is crucial for understanding motion-related semantics
Input: Video sequence (T frames) + natural language expression
Output: Bounding boxes and IDs for all targets matching the expression description in each frame
Constraints:
Variable number of targets (0 to multiple)
Annotation only during time periods when targets satisfy the expression description
where PV and PL are positional encodings for vision and language respectively. After fusion, the deformable encoder layer processes:
Etl=DeformEnc(I^tl)
Decoder (dual-query mechanism):
Tracking queriesQttra: Transformed from previous frame decoder embeddings Dt−1, used for associating tracked instances
Detection queriesQdet: Randomly initialized, used for detecting newly appeared targets
Importance of temporal modeling: Significant AssA improvement (+5.53%) demonstrates that long-term temporal dependencies are crucial for tracking quality
End-to-end advantages: Single-stage methods generally outperform two-stage methods; joint optimization is more effective
Language complexity impact: Performance decrease on Refer-KITTI-V2 reflects challenges posed by richer semantics
Memory mechanism generalization: Inference can use longer historical windows than training
Query representation efficiency: Query representations are more compact than raw features while preserving key information
Referring understanding:
7. Yu et al. (2016) - RefCOCO series datasets
8. Kamath et al. (2021) - MDETR: Multimodal detection
Evaluation metrics:
9. Luiten et al. (2020) - HOTA: Higher order tracking accuracy
Overall Assessment: This is a high-quality computer vision paper with substantial innovations in task definition, dataset construction, and method design. The RMOT task has important theoretical significance and practical value; Refer-KITTI-V2 provides valuable resources for the community; TempRMOT framework is well-designed and effective. Main limitations lie in scene constraints and unknown real-time performance. Recommended future work includes extending to more domains and conducting deeper theoretical analysis. This paper is expected to become an important reference in language-guided video understanding.