2025-11-14T19:01:11.711286

Bootstrapping Referring Multi-Object Tracking

Zhang, Wu, Han et al.

Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.

academic

Bootstrapping Referring Multi-Object Tracking

Basic Information

Paper ID: 2406.05039
Title: Referring Multi-Object Tracking with Comprehensive Dynamic Expressions
Authors: Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong, Shengcai Liao, Bo Du
Classification: cs.CV cs.CL
Publication Date: October 27, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2406.05039
Code and Dataset: https://github.com/zyn213/TempRMOT

Abstract

This paper proposes a novel video understanding task—Referring Multi-Object Tracking (RMOT)—which aims to guide multi-object tracking predictions through natural language expressions as semantic cues, comprehensively considering target quantity variations and temporal semantics. The paper constructs the Refer-KITTI-V2 benchmark dataset containing 9,758 diverse language expressions and proposes the TempRMOT framework, which achieves long-term spatiotemporal interaction through a query-driven temporal enhancement module. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2.

Research Background and Motivation

Problems to Address

Existing referring understanding tasks have two core limitations:

Single-target constraint: Existing datasets (e.g., RefCOCO series, Refer-DAVIS17) annotate only a single target per expression, while in real scenarios, one expression may refer to multiple, single, or zero targets
Temporal consistency absence: Existing methods cannot model temporal consistency between language expressions and target evolution states. For example, the expression "a car turning" describes an instantaneous state, but annotations would continue tracking the target even after the turning action is completed

Problem Significance

Language-guided video understanding is a key task connecting natural language with visual content
In practical applications such as autonomous driving, it is necessary to simultaneously track multiple dynamic targets through natural language instructions
Accurately modeling temporal dynamics is crucial for understanding motion-related semantics

Limitations of Existing Methods

Dataset level:
- Manual annotation combined with fixed templates, limited language diversity
- Severe semantic redundancy (e.g., Refer-Dance has only 48 unique expressions)
- Lack of implicit expressions and complex semantics (e.g., negation descriptions)
Method level:
- Two-stage methods have high complexity and computational overhead
- Single-stage methods primarily focus on adjacent frames, lacking long-term temporal modeling capability

Core Contributions

Proposes RMOT task: First systematically extends referring understanding to multi-target dynamic scenarios while considering temporal state changes
Constructs Refer-KITTI-V2 dataset:
- Contains 9,758 expressions, 7,193 unique expressions, 617 different vocabularies
- Designs a three-step semi-automatic annotation pipeline combining LLM-generated diverse expressions
- Includes implicit expressions (e.g., "the ego vehicle is positioned behind the black car")
Proposes TempRMOT framework:
- End-to-end Transformer architecture without post-processing
- Query-driven temporal enhancement module enabling long-term spatiotemporal interaction
- Decouples tracking queries and detection queries to handle variable numbers of targets
Achieves SOTA performance:
- Approximately 4% HOTA improvement over prior work on Refer-KITTI-V2
- Achieves 52.21% HOTA on Refer-KITTI
Designs efficient annotation pipeline: Three-step semi-automatic annotation method significantly reduces manual labor

Method Details

Task Definition

Input: Video sequence (T frames) + natural language expression Output: Bounding boxes and IDs for all targets matching the expression description in each frame Constraints:

Variable number of targets (0 to multiple)
Annotation only during time periods when targets satisfy the expression description
Maintain temporally consistent ID associations

Model Architecture

TempRMOT consists of two core components:

1. Transformer-Based RMOT Module

Feature Extractors:

Visual encoding: CNN backbone extracts multi-scale features $I^l_t \in \mathbb{R}^{C_l \times H_l \times W_l}$
Language encoding: RoBERTa encodes text as word embeddings $S \in \mathbb{R}^{L \times D}$

Cross-modal Encoder (early fusion strategy): $Q = W_q(I^l_t + P_V), \quad K = W_k(S + P_L), \quad V = W_vS$ $\hat{I}^l_t = \frac{QK^T}{\sqrt{d}}V + I^l_t$

where $P_V$ and $P_L$ are positional encodings for vision and language respectively. After fusion, the deformable encoder layer processes: $E^l_t = \text{DeformEnc}(\hat{I}^l_t)$

Decoder (dual-query mechanism):

Tracking queries $Q^{tra}_t$ : Transformed from previous frame decoder embeddings $D_{t-1}$ , used for associating tracked instances
Detection queries $Q^{det}$ : Randomly initialized, used for detecting newly appeared targets

$Q_t = \text{Decoder}(E^l_t, \text{concat}(Q^{det}, Q^{tra}_t))$

Referring head: Contains three branches

Classification branch: Binary classification (real target/null object)
Bounding box branch: 3-layer FFN for coordinate regression
Referring branch: Outputs matching probability with expression

2. Temporal Enhancement Module

Query Memory Mechanism:

Maintains $N \times K$ memory queue (N frames, K objects per frame)
Updates following FIFO principle, maintaining constant memory consumption

Temporal Decoder (4 layers): Aggregates historical information through cross-frame attention: $Q_t = \text{CrossFrameAttn}(Q=Q_t, K=Q_{t-\tau_h:t}, V=Q_{t-\tau_h:t}, PE=\text{Pos}(t-\tau_h:t))$

where $\tau_h$ is the temporal window size and $\text{Pos}$ encodes temporal positions.

Object Decoder (4 layers): Models spatial interaction through cross-object attention: $Q_t = \text{CrossObjectAttn}(Q, K, V=Q_t, PE=\text{Pos}(O_{1:N_t}))$

Trajectory Refinement: Uses MLP to predict residual adjustments: $B_t = B_t + \text{MLP}(Q^S_t)$

where $Q^S_t$ is the spatiotemporally enhanced query feature.

Technical Innovations

Early cross-modal fusion: Compared to MDETR's dense connections, employs efficient attention-weighted strategy, reducing computational complexity
Dual-query decoupling design:
- Tracking queries inherit historical information, ensuring ID consistency
- Detection queries handle new targets, improving flexibility
Query-driven temporal modeling:
- Uses compact query representations rather than raw features for temporal aggregation
- Separates temporal and spatial attention mechanisms
- Supports long-term dependencies (up to 8 frames of history)
End-to-end differentiability: No need for NMS post-processing, directly outputs final results

Experimental Setup

Datasets

Refer-KITTI:

18 videos, 895 expressions
Training set: 15 videos/660 expressions
Test set: 3 videos/158 expressions

Refer-KITTI-V2:

21 videos, 9,758 expressions
Training set: 17 videos/8,873 expressions
Test set: 4 videos/897 expressions
Features: 7,193 unique expressions, 617 different vocabularies, includes implicit expressions

KITTI: Used for evaluating general MOT capability

Dataset Construction Pipeline

Step 1: Language Item Collection

Annotate basic attributes: category (car/people), color (black/red), position (left/right), action (moving/turning)
Automatically propagate annotations using KITTI instance IDs

Step 2: Expression Generation

Combine language items using predefined templates
Example: "{color}-{action}-cars" → "black turning cars"
Associate bounding boxes through AND operations

Step 3: Expression Expansion

Use GPT-3.5 to generate 4 semantically equivalent paraphrases for each expression
Two-stage verification: LLM verification + manual review
Expand from 2,719 to 9,758 expressions

Evaluation Metrics

HOTA (Higher Order Tracking Accuracy): $\text{HOTA} = \sqrt{\text{DetA} \cdot \text{AssA}}$

DetA (Detection Accuracy): Frame-level detection IoU score
AssA (Association Accuracy): Temporal association IoU score
Other metrics: DetRe, DetPr, AssRe, AssPr, LocA

Comparison Methods

Two-stage methods:

FairMOT, DeepSORT, ByteTrack, CStrack
TransTrack, TrackFormer
iKUN

Single-stage methods:

EchoTrack, DeepRMOT
TransRMOT (prior work)
MLS-Track

Implementation Details

Backbone: ResNet-50 (vision) + RoBERTa (text)
Optimizer: Adam, learning rate 1e-5 (backbone 1e-5)
Training: 60 epochs, batch size=1, 4×RTX 4090
Data augmentation: Random cropping, multi-scale (800-1536)
Memory length: Refer-KITTI N=4, Refer-KITTI-V2 N=5
Inference threshold: Classification 0.6, referring 0.4
Loss weights: $\lambda^D_{cls}=5, \lambda^D_{L1}=2, \lambda^D_{giou}=2, \lambda^D_{ref}=2$

Experimental Results

Main Results

Refer-KITTI Performance:

Method	E2E	HOTA	DetA	AssA	DetRe	DetPr
iKUN	✗	48.84	35.74	66.80	51.97	52.25
TransRMOT	✓	46.56	37.97	57.33	49.69	60.10
MLS-Track	✓	49.05	40.03	60.25	59.07	54.18
TempRMOT	✓	52.21	40.95	66.75	55.65	59.25

3.16% HOTA improvement over MLS-Track
Comprehensive leadership among end-to-end methods

Refer-KITTI-V2 Performance:

Method	HOTA	DetA	AssA
iKUN	10.32	2.17	49.77
TransRMOT	31.00	19.40	49.68
TempRMOT	35.04	22.97	53.58

4.04% HOTA improvement over TransRMOT
Validates effectiveness in more complex language scenarios

KITTI Performance:

Method	HOTA	AssA
TransRMOT	61.52	66.51
TempRMOT	63.47	72.04

5.53% AssA improvement, demonstrating effectiveness of temporal modeling

Ablation Studies

Module Effectiveness (Refer-KITTI-V2):

Temp.	Refine	HOTA	DetA	AssA
✗	✗	31.00	19.40	49.68
✓	✗	34.46	22.73	52.37
✓	✓	35.04	22.97	53.58

Temporal enhancement module contributes most (+3.46% HOTA)
Trajectory refinement provides further improvement (+0.58% HOTA)

Training Memory Length:

$N_t$	HOTA	DetA	AssA
3	33.64	21.96	51.66
4	34.41	22.43	52.90
5	34.72	22.59	53.49

Longer historical context brings continuous improvement

Inference Memory Length:

$N_i$	HOTA	DetA	AssA
5	34.72	22.59	53.49
6	34.78	22.73	53.32
8	35.04	22.97	53.58

Using longer memory during inference provides further improvement
Demonstrates generalization capability of temporal module

Case Analysis

Motion Understanding Capability:

Instruction "left cars which are parking": TempRMOT correctly identifies stationary vehicles, TransRMOT mistakenly marks pedestrians as parking
Instruction "right persons who are walking": TempRMOT accurately understands motion state

Robust Tracking Capability:

Instruction "cars in front of ours": TransRMOT exhibits ID switches and tracking loss, TempRMOT maintains consistent ID association

Complex Semantic Understanding:

Handles implicit expressions "the ego car is positioned after the black cars"
Understands negation descriptions "pedestrians lacking hair"
Combines multiple attributes "the men are on the right side and they have t-shirts on"

Experimental Findings

Importance of temporal modeling: Significant AssA improvement (+5.53%) demonstrates that long-term temporal dependencies are crucial for tracking quality
End-to-end advantages: Single-stage methods generally outperform two-stage methods; joint optimization is more effective
Language complexity impact: Performance decrease on Refer-KITTI-V2 reflects challenges posed by richer semantics
Memory mechanism generalization: Inference can use longer historical windows than training
Query representation efficiency: Query representations are more compact than raw features while preserving key information

RMOT Benchmark Datasets

Limitations of existing datasets:

RefCOCO series: Images only, single target
Talk2Car, VID-Sentence: Videos but single target
Refer-DAVIS17, Refer-YV: Pixel-level segmentation, single target

RMOT dataset comparison:

Dataset	Videos	Vocabulary	Expressions	Unique Expressions	Implicit Expressions
Refer-KITTI	18	49	895	215	✗
GroOT*	14	260	1547	1161	✗
Refer-Dance	65	25	1985	48	✗
Refer-KITTI-V2	21	617	9758	7193	✓

RMOT Methods

Two-stage methods:

Extract trajectories first, then match expressions
Advantages: Fine-grained processing
Disadvantages: High complexity, large computational overhead

Single-stage methods:

End-to-end Transformer framework
TransRMOT: First RMOT model
Limitations: Primarily focus on adjacent frames, lack long-term modeling

Query-Driven Temporal Modeling

Related work:

MeMOT: Memory module storing historical queries
MeMOTR: Temporal context-enhanced tracking queries
BEVFormer: Spatiotemporal Transformer for BEV representation

Innovations in this paper:

Focus on language-conditioned video understanding
Separate temporal and spatial attention
Joint reasoning combining current frame spatial features

Conclusions and Discussion

Main Conclusions

RMOT task is more general: Overcomes single-target limitations, considers temporal dynamics, better aligns with real-world requirements
Refer-KITTI-V2 is high-quality: Through semi-automatic pipeline and LLM, achieves balance between scale and diversity
TempRMOT is effective: Temporal enhancement module significantly improves performance, achieving SOTA on both benchmarks
Long-term dependencies are key: Explicit modeling of spatiotemporal interaction is crucial for accurate tracking and semantic alignment

Limitations

Dataset scale: Although expressions are diverse, video count (21) is relatively limited, limiting scene diversity
Computational complexity: While query representations reduce overhead, multi-frame memory still requires additional computation
Language understanding depth: Challenges remain for extremely complex logical reasoning (e.g., multiple negations, complex causal relationships)
Occlusion handling: Paper lacks detailed discussion of strategies for severe occlusion scenarios
Real-time performance: FPS and other real-time metrics not reported; practical deployment feasibility unclear
Generalization capability: Validation only on KITTI scenarios (driving scenes); generalization to other domains (e.g., pedestrians, sports) unknown

Future Directions

Extend to more scenarios: Construct RMOT datasets covering more domains
Improve real-time performance: Optimize model structure for real-time tracking
Enhance language understanding: Incorporate stronger language models (e.g., GPT-4)
3D extension: Combine with point cloud data, extend to 3D RMOT
Interactive tracking: Support real-time user corrections and feedback

In-Depth Evaluation

Strengths

1. Task definition is forward-looking

RMOT task fills the gap of multi-target + temporal dynamics
Temporal consistency modeling (e.g., instantaneous state of "turning") is highly practical
Provides new paradigm for language-guided autonomous driving

2. Dataset construction is scientifically efficient

Three-step semi-automatic pipeline balances quality and efficiency
LLM-assisted generation significantly improves diversity (7,193 unique expressions)
Introduction of implicit expressions increases challenge and realism

3. Method design is reasonable

Early fusion strategy reduces computational complexity
Dual-query decoupling design balances historical association and new target detection
Spatially separated attention mechanism is clear and effective

4. Experiments are comprehensive

Validation on three datasets
Detailed ablation studies quantify module contributions
Rich visualization cases demonstrate model capabilities

5. Writing is clear

Logical progression from motivation to method to experiments
Rich figures (10 figures, 5 tables), high information density
Complete technical details enable reproducibility

Weaknesses

1. Dataset limitations

Few videos (21), single scene (driving only)
Although many expressions, based on limited language item combinations; limited deep semantic diversity
Lacks challenging scenarios (extreme weather, nighttime, etc.)

2. Method limitations

Fixed memory length (N=5), cannot adapt dynamically
Does not handle expression ambiguity (e.g., "left car" ambiguous from different viewpoints)
Lacks uncertainty estimation; cannot quantify prediction confidence

3. Experimental insufficiencies

Inference speed (FPS) not reported; real-time performance unclear
Lacks cross-dataset generalization experiments (e.g., testing on Refer-Dance)
No comparison with latest vision-language models (e.g., CLIP, BLIP-2)
Insufficient error analysis; main failure modes not documented

4. Missing theoretical analysis

No theoretical explanation for why temporal modeling is effective
Lacks visualization of attention weights
No discussion of model learning dynamics and convergence

5. Insufficient discussion of societal impact

Privacy concerns not discussed (ethical issues of pedestrian tracking)
Potential biases not analyzed (recognition bias for specific populations)

Impact

Contributions to the field:

Task level: RMOT task will become important direction in video understanding; multiple follow-up works already cite this
Data level: Refer-KITTI-V2 provides high-quality benchmark for community; open-sourced code and data promote research
Method level: Temporal enhancement module design is transferable to other video tasks

Practical value:

Autonomous driving: Supports language-instructed vehicle control ("follow the red car ahead")
Intelligent surveillance: Multi-target retrieval based on descriptions ("pedestrian in red clothes")
Human-computer interaction: Natural language-guided video editing

Reproducibility:

Code and dataset open-sourced (https://github.com/zyn213/TempRMOT)
Complete implementation details (hyperparameters, training strategies)
Based on mature framework (Deformable DETR), easy to reproduce

Expected impact:

Short-term (1-2 years): Inspire more RMOT datasets and methods
Medium-term (3-5 years): Integration with large language models for stronger semantic understanding
Long-term (5+ years): Become standard component of multimodal autonomous driving systems

Applicable Scenarios

Most suitable scenarios:

Autonomous driving: Language-instructed vehicle tracking and path planning
Intelligent transportation: Traffic participant detection based on descriptions ("illegally parked vehicles")
Video surveillance: Natural language query-based target retrieval
Robot navigation: Language-guided target following

Less suitable scenarios:

High-speed scenarios: Current method may not meet real-time requirements
Severe occlusion: Tracking under heavy occlusion remains challenging
Open-domain scenarios: Training data limited to driving scenes; generalization unverified
Fine-grained descriptions: May struggle with extremely detailed appearance descriptions (e.g., "person in blue striped shirt")

Improvement suggestions:

Extend to more scenarios (indoor, sports, social activities)
Optimize model for real-time performance
Introduce active learning for few-shot adaptation to new scenes

References

Key Citations

RMOT-related:

Wu et al. (2023) - TransRMOT: First RMOT method and Refer-KITTI dataset
Du et al. (2024) - iKUN: Training-free tracker
Ma et al. (2024) - MLS-Track: Multi-level semantic interaction

Transformer tracking: 4. Zeng et al. (2022) - MOTR: End-to-end multi-object tracking 5. Zhu et al. (2020) - Deformable DETR: Deformable attention 6. Gao & Wang (2023) - MeMOTR: Long-term memory-enhanced tracking

Referring understanding: 7. Yu et al. (2016) - RefCOCO series datasets 8. Kamath et al. (2021) - MDETR: Multimodal detection

Evaluation metrics: 9. Luiten et al. (2020) - HOTA: Higher order tracking accuracy

Overall Assessment: This is a high-quality computer vision paper with substantial innovations in task definition, dataset construction, and method design. The RMOT task has important theoretical significance and practical value; Refer-KITTI-V2 provides valuable resources for the community; TempRMOT framework is well-designed and effective. Main limitations lie in scene constraints and unknown real-time performance. Recommended future work includes extending to more domains and conducting deeper theoretical analysis. This paper is expected to become an important reference in language-guided video understanding.