2025-11-14T19:01:11.711286

Bootstrapping Referring Multi-Object Tracking

Zhang, Wu, Han et al.
Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
academic

Bootstrapping Referring Multi-Object Tracking

Basic Information

  • Paper ID: 2406.05039
  • Title: Referring Multi-Object Tracking with Comprehensive Dynamic Expressions
  • Authors: Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong, Shengcai Liao, Bo Du
  • Classification: cs.CV cs.CL
  • Publication Date: October 27, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2406.05039
  • Code and Dataset: https://github.com/zyn213/TempRMOT

Abstract

This paper proposes a novel video understanding task—Referring Multi-Object Tracking (RMOT)—which aims to guide multi-object tracking predictions through natural language expressions as semantic cues, comprehensively considering target quantity variations and temporal semantics. The paper constructs the Refer-KITTI-V2 benchmark dataset containing 9,758 diverse language expressions and proposes the TempRMOT framework, which achieves long-term spatiotemporal interaction through a query-driven temporal enhancement module. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2.

Research Background and Motivation

Problems to Address

Existing referring understanding tasks have two core limitations:

  1. Single-target constraint: Existing datasets (e.g., RefCOCO series, Refer-DAVIS17) annotate only a single target per expression, while in real scenarios, one expression may refer to multiple, single, or zero targets
  2. Temporal consistency absence: Existing methods cannot model temporal consistency between language expressions and target evolution states. For example, the expression "a car turning" describes an instantaneous state, but annotations would continue tracking the target even after the turning action is completed

Problem Significance

  • Language-guided video understanding is a key task connecting natural language with visual content
  • In practical applications such as autonomous driving, it is necessary to simultaneously track multiple dynamic targets through natural language instructions
  • Accurately modeling temporal dynamics is crucial for understanding motion-related semantics

Limitations of Existing Methods

  1. Dataset level:
    • Manual annotation combined with fixed templates, limited language diversity
    • Severe semantic redundancy (e.g., Refer-Dance has only 48 unique expressions)
    • Lack of implicit expressions and complex semantics (e.g., negation descriptions)
  2. Method level:
    • Two-stage methods have high complexity and computational overhead
    • Single-stage methods primarily focus on adjacent frames, lacking long-term temporal modeling capability

Core Contributions

  1. Proposes RMOT task: First systematically extends referring understanding to multi-target dynamic scenarios while considering temporal state changes
  2. Constructs Refer-KITTI-V2 dataset:
    • Contains 9,758 expressions, 7,193 unique expressions, 617 different vocabularies
    • Designs a three-step semi-automatic annotation pipeline combining LLM-generated diverse expressions
    • Includes implicit expressions (e.g., "the ego vehicle is positioned behind the black car")
  3. Proposes TempRMOT framework:
    • End-to-end Transformer architecture without post-processing
    • Query-driven temporal enhancement module enabling long-term spatiotemporal interaction
    • Decouples tracking queries and detection queries to handle variable numbers of targets
  4. Achieves SOTA performance:
    • Approximately 4% HOTA improvement over prior work on Refer-KITTI-V2
    • Achieves 52.21% HOTA on Refer-KITTI
  5. Designs efficient annotation pipeline: Three-step semi-automatic annotation method significantly reduces manual labor

Method Details

Task Definition

Input: Video sequence (T frames) + natural language expression Output: Bounding boxes and IDs for all targets matching the expression description in each frame Constraints:

  • Variable number of targets (0 to multiple)
  • Annotation only during time periods when targets satisfy the expression description
  • Maintain temporally consistent ID associations

Model Architecture

TempRMOT consists of two core components:

1. Transformer-Based RMOT Module

Feature Extractors:

  • Visual encoding: CNN backbone extracts multi-scale features ItlRCl×Hl×WlI^l_t \in \mathbb{R}^{C_l \times H_l \times W_l}
  • Language encoding: RoBERTa encodes text as word embeddings SRL×DS \in \mathbb{R}^{L \times D}

Cross-modal Encoder (early fusion strategy): Q=Wq(Itl+PV),K=Wk(S+PL),V=WvSQ = W_q(I^l_t + P_V), \quad K = W_k(S + P_L), \quad V = W_vSI^tl=QKTdV+Itl\hat{I}^l_t = \frac{QK^T}{\sqrt{d}}V + I^l_t

where PVP_V and PLP_L are positional encodings for vision and language respectively. After fusion, the deformable encoder layer processes: Etl=DeformEnc(I^tl)E^l_t = \text{DeformEnc}(\hat{I}^l_t)

Decoder (dual-query mechanism):

  • Tracking queries QttraQ^{tra}_t: Transformed from previous frame decoder embeddings Dt1D_{t-1}, used for associating tracked instances
  • Detection queries QdetQ^{det}: Randomly initialized, used for detecting newly appeared targets

Qt=Decoder(Etl,concat(Qdet,Qttra))Q_t = \text{Decoder}(E^l_t, \text{concat}(Q^{det}, Q^{tra}_t))

Referring head: Contains three branches

  • Classification branch: Binary classification (real target/null object)
  • Bounding box branch: 3-layer FFN for coordinate regression
  • Referring branch: Outputs matching probability with expression

2. Temporal Enhancement Module

Query Memory Mechanism:

  • Maintains N×KN \times K memory queue (N frames, K objects per frame)
  • Updates following FIFO principle, maintaining constant memory consumption

Temporal Decoder (4 layers): Aggregates historical information through cross-frame attention: Qt=CrossFrameAttn(Q=Qt,K=Qtτh:t,V=Qtτh:t,PE=Pos(tτh:t))Q_t = \text{CrossFrameAttn}(Q=Q_t, K=Q_{t-\tau_h:t}, V=Q_{t-\tau_h:t}, PE=\text{Pos}(t-\tau_h:t))

where τh\tau_h is the temporal window size and Pos\text{Pos} encodes temporal positions.

Object Decoder (4 layers): Models spatial interaction through cross-object attention: Qt=CrossObjectAttn(Q,K,V=Qt,PE=Pos(O1:Nt))Q_t = \text{CrossObjectAttn}(Q, K, V=Q_t, PE=\text{Pos}(O_{1:N_t}))

Trajectory Refinement: Uses MLP to predict residual adjustments: Bt=Bt+MLP(QtS)B_t = B_t + \text{MLP}(Q^S_t)

where QtSQ^S_t is the spatiotemporally enhanced query feature.

Technical Innovations

  1. Early cross-modal fusion: Compared to MDETR's dense connections, employs efficient attention-weighted strategy, reducing computational complexity
  2. Dual-query decoupling design:
    • Tracking queries inherit historical information, ensuring ID consistency
    • Detection queries handle new targets, improving flexibility
  3. Query-driven temporal modeling:
    • Uses compact query representations rather than raw features for temporal aggregation
    • Separates temporal and spatial attention mechanisms
    • Supports long-term dependencies (up to 8 frames of history)
  4. End-to-end differentiability: No need for NMS post-processing, directly outputs final results

Experimental Setup

Datasets

Refer-KITTI:

  • 18 videos, 895 expressions
  • Training set: 15 videos/660 expressions
  • Test set: 3 videos/158 expressions

Refer-KITTI-V2:

  • 21 videos, 9,758 expressions
  • Training set: 17 videos/8,873 expressions
  • Test set: 4 videos/897 expressions
  • Features: 7,193 unique expressions, 617 different vocabularies, includes implicit expressions

KITTI: Used for evaluating general MOT capability

Dataset Construction Pipeline

Step 1: Language Item Collection

  • Annotate basic attributes: category (car/people), color (black/red), position (left/right), action (moving/turning)
  • Automatically propagate annotations using KITTI instance IDs

Step 2: Expression Generation

  • Combine language items using predefined templates
  • Example: "{color}-{action}-cars" → "black turning cars"
  • Associate bounding boxes through AND operations

Step 3: Expression Expansion

  • Use GPT-3.5 to generate 4 semantically equivalent paraphrases for each expression
  • Two-stage verification: LLM verification + manual review
  • Expand from 2,719 to 9,758 expressions

Evaluation Metrics

HOTA (Higher Order Tracking Accuracy): HOTA=DetAAssA\text{HOTA} = \sqrt{\text{DetA} \cdot \text{AssA}}

  • DetA (Detection Accuracy): Frame-level detection IoU score
  • AssA (Association Accuracy): Temporal association IoU score
  • Other metrics: DetRe, DetPr, AssRe, AssPr, LocA

Comparison Methods

Two-stage methods:

  • FairMOT, DeepSORT, ByteTrack, CStrack
  • TransTrack, TrackFormer
  • iKUN

Single-stage methods:

  • EchoTrack, DeepRMOT
  • TransRMOT (prior work)
  • MLS-Track

Implementation Details

  • Backbone: ResNet-50 (vision) + RoBERTa (text)
  • Optimizer: Adam, learning rate 1e-5 (backbone 1e-5)
  • Training: 60 epochs, batch size=1, 4×RTX 4090
  • Data augmentation: Random cropping, multi-scale (800-1536)
  • Memory length: Refer-KITTI N=4, Refer-KITTI-V2 N=5
  • Inference threshold: Classification 0.6, referring 0.4
  • Loss weights: λclsD=5,λL1D=2,λgiouD=2,λrefD=2\lambda^D_{cls}=5, \lambda^D_{L1}=2, \lambda^D_{giou}=2, \lambda^D_{ref}=2

Experimental Results

Main Results

Refer-KITTI Performance:

MethodE2EHOTADetAAssADetReDetPr
iKUN48.8435.7466.8051.9752.25
TransRMOT46.5637.9757.3349.6960.10
MLS-Track49.0540.0360.2559.0754.18
TempRMOT52.2140.9566.7555.6559.25
  • 3.16% HOTA improvement over MLS-Track
  • Comprehensive leadership among end-to-end methods

Refer-KITTI-V2 Performance:

MethodHOTADetAAssA
iKUN10.322.1749.77
TransRMOT31.0019.4049.68
TempRMOT35.0422.9753.58
  • 4.04% HOTA improvement over TransRMOT
  • Validates effectiveness in more complex language scenarios

KITTI Performance:

MethodHOTAAssA
TransRMOT61.5266.51
TempRMOT63.4772.04
  • 5.53% AssA improvement, demonstrating effectiveness of temporal modeling

Ablation Studies

Module Effectiveness (Refer-KITTI-V2):

Temp.RefineHOTADetAAssA
31.0019.4049.68
34.4622.7352.37
35.0422.9753.58
  • Temporal enhancement module contributes most (+3.46% HOTA)
  • Trajectory refinement provides further improvement (+0.58% HOTA)

Training Memory Length:

NtN_tHOTADetAAssA
333.6421.9651.66
434.4122.4352.90
534.7222.5953.49
  • Longer historical context brings continuous improvement

Inference Memory Length:

NiN_iHOTADetAAssA
534.7222.5953.49
634.7822.7353.32
835.0422.9753.58
  • Using longer memory during inference provides further improvement
  • Demonstrates generalization capability of temporal module

Case Analysis

Motion Understanding Capability:

  • Instruction "left cars which are parking": TempRMOT correctly identifies stationary vehicles, TransRMOT mistakenly marks pedestrians as parking
  • Instruction "right persons who are walking": TempRMOT accurately understands motion state

Robust Tracking Capability:

  • Instruction "cars in front of ours": TransRMOT exhibits ID switches and tracking loss, TempRMOT maintains consistent ID association

Complex Semantic Understanding:

  • Handles implicit expressions "the ego car is positioned after the black cars"
  • Understands negation descriptions "pedestrians lacking hair"
  • Combines multiple attributes "the men are on the right side and they have t-shirts on"

Experimental Findings

  1. Importance of temporal modeling: Significant AssA improvement (+5.53%) demonstrates that long-term temporal dependencies are crucial for tracking quality
  2. End-to-end advantages: Single-stage methods generally outperform two-stage methods; joint optimization is more effective
  3. Language complexity impact: Performance decrease on Refer-KITTI-V2 reflects challenges posed by richer semantics
  4. Memory mechanism generalization: Inference can use longer historical windows than training
  5. Query representation efficiency: Query representations are more compact than raw features while preserving key information

RMOT Benchmark Datasets

Limitations of existing datasets:

  • RefCOCO series: Images only, single target
  • Talk2Car, VID-Sentence: Videos but single target
  • Refer-DAVIS17, Refer-YV: Pixel-level segmentation, single target

RMOT dataset comparison:

DatasetVideosVocabularyExpressionsUnique ExpressionsImplicit Expressions
Refer-KITTI1849895215
GroOT*1426015471161
Refer-Dance6525198548
Refer-KITTI-V22161797587193

RMOT Methods

Two-stage methods:

  • Extract trajectories first, then match expressions
  • Advantages: Fine-grained processing
  • Disadvantages: High complexity, large computational overhead

Single-stage methods:

  • End-to-end Transformer framework
  • TransRMOT: First RMOT model
  • Limitations: Primarily focus on adjacent frames, lack long-term modeling

Query-Driven Temporal Modeling

Related work:

  • MeMOT: Memory module storing historical queries
  • MeMOTR: Temporal context-enhanced tracking queries
  • BEVFormer: Spatiotemporal Transformer for BEV representation

Innovations in this paper:

  • Focus on language-conditioned video understanding
  • Separate temporal and spatial attention
  • Joint reasoning combining current frame spatial features

Conclusions and Discussion

Main Conclusions

  1. RMOT task is more general: Overcomes single-target limitations, considers temporal dynamics, better aligns with real-world requirements
  2. Refer-KITTI-V2 is high-quality: Through semi-automatic pipeline and LLM, achieves balance between scale and diversity
  3. TempRMOT is effective: Temporal enhancement module significantly improves performance, achieving SOTA on both benchmarks
  4. Long-term dependencies are key: Explicit modeling of spatiotemporal interaction is crucial for accurate tracking and semantic alignment

Limitations

  1. Dataset scale: Although expressions are diverse, video count (21) is relatively limited, limiting scene diversity
  2. Computational complexity: While query representations reduce overhead, multi-frame memory still requires additional computation
  3. Language understanding depth: Challenges remain for extremely complex logical reasoning (e.g., multiple negations, complex causal relationships)
  4. Occlusion handling: Paper lacks detailed discussion of strategies for severe occlusion scenarios
  5. Real-time performance: FPS and other real-time metrics not reported; practical deployment feasibility unclear
  6. Generalization capability: Validation only on KITTI scenarios (driving scenes); generalization to other domains (e.g., pedestrians, sports) unknown

Future Directions

  1. Extend to more scenarios: Construct RMOT datasets covering more domains
  2. Improve real-time performance: Optimize model structure for real-time tracking
  3. Enhance language understanding: Incorporate stronger language models (e.g., GPT-4)
  4. 3D extension: Combine with point cloud data, extend to 3D RMOT
  5. Interactive tracking: Support real-time user corrections and feedback

In-Depth Evaluation

Strengths

1. Task definition is forward-looking

  • RMOT task fills the gap of multi-target + temporal dynamics
  • Temporal consistency modeling (e.g., instantaneous state of "turning") is highly practical
  • Provides new paradigm for language-guided autonomous driving

2. Dataset construction is scientifically efficient

  • Three-step semi-automatic pipeline balances quality and efficiency
  • LLM-assisted generation significantly improves diversity (7,193 unique expressions)
  • Introduction of implicit expressions increases challenge and realism

3. Method design is reasonable

  • Early fusion strategy reduces computational complexity
  • Dual-query decoupling design balances historical association and new target detection
  • Spatially separated attention mechanism is clear and effective

4. Experiments are comprehensive

  • Validation on three datasets
  • Detailed ablation studies quantify module contributions
  • Rich visualization cases demonstrate model capabilities

5. Writing is clear

  • Logical progression from motivation to method to experiments
  • Rich figures (10 figures, 5 tables), high information density
  • Complete technical details enable reproducibility

Weaknesses

1. Dataset limitations

  • Few videos (21), single scene (driving only)
  • Although many expressions, based on limited language item combinations; limited deep semantic diversity
  • Lacks challenging scenarios (extreme weather, nighttime, etc.)

2. Method limitations

  • Fixed memory length (N=5), cannot adapt dynamically
  • Does not handle expression ambiguity (e.g., "left car" ambiguous from different viewpoints)
  • Lacks uncertainty estimation; cannot quantify prediction confidence

3. Experimental insufficiencies

  • Inference speed (FPS) not reported; real-time performance unclear
  • Lacks cross-dataset generalization experiments (e.g., testing on Refer-Dance)
  • No comparison with latest vision-language models (e.g., CLIP, BLIP-2)
  • Insufficient error analysis; main failure modes not documented

4. Missing theoretical analysis

  • No theoretical explanation for why temporal modeling is effective
  • Lacks visualization of attention weights
  • No discussion of model learning dynamics and convergence

5. Insufficient discussion of societal impact

  • Privacy concerns not discussed (ethical issues of pedestrian tracking)
  • Potential biases not analyzed (recognition bias for specific populations)

Impact

Contributions to the field:

  • Task level: RMOT task will become important direction in video understanding; multiple follow-up works already cite this
  • Data level: Refer-KITTI-V2 provides high-quality benchmark for community; open-sourced code and data promote research
  • Method level: Temporal enhancement module design is transferable to other video tasks

Practical value:

  • Autonomous driving: Supports language-instructed vehicle control ("follow the red car ahead")
  • Intelligent surveillance: Multi-target retrieval based on descriptions ("pedestrian in red clothes")
  • Human-computer interaction: Natural language-guided video editing

Reproducibility:

  • Code and dataset open-sourced (https://github.com/zyn213/TempRMOT)
  • Complete implementation details (hyperparameters, training strategies)
  • Based on mature framework (Deformable DETR), easy to reproduce

Expected impact:

  • Short-term (1-2 years): Inspire more RMOT datasets and methods
  • Medium-term (3-5 years): Integration with large language models for stronger semantic understanding
  • Long-term (5+ years): Become standard component of multimodal autonomous driving systems

Applicable Scenarios

Most suitable scenarios:

  1. Autonomous driving: Language-instructed vehicle tracking and path planning
  2. Intelligent transportation: Traffic participant detection based on descriptions ("illegally parked vehicles")
  3. Video surveillance: Natural language query-based target retrieval
  4. Robot navigation: Language-guided target following

Less suitable scenarios:

  1. High-speed scenarios: Current method may not meet real-time requirements
  2. Severe occlusion: Tracking under heavy occlusion remains challenging
  3. Open-domain scenarios: Training data limited to driving scenes; generalization unverified
  4. Fine-grained descriptions: May struggle with extremely detailed appearance descriptions (e.g., "person in blue striped shirt")

Improvement suggestions:

  • Extend to more scenarios (indoor, sports, social activities)
  • Optimize model for real-time performance
  • Introduce active learning for few-shot adaptation to new scenes

References

Key Citations

RMOT-related:

  1. Wu et al. (2023) - TransRMOT: First RMOT method and Refer-KITTI dataset
  2. Du et al. (2024) - iKUN: Training-free tracker
  3. Ma et al. (2024) - MLS-Track: Multi-level semantic interaction

Transformer tracking: 4. Zeng et al. (2022) - MOTR: End-to-end multi-object tracking 5. Zhu et al. (2020) - Deformable DETR: Deformable attention 6. Gao & Wang (2023) - MeMOTR: Long-term memory-enhanced tracking

Referring understanding: 7. Yu et al. (2016) - RefCOCO series datasets 8. Kamath et al. (2021) - MDETR: Multimodal detection

Evaluation metrics: 9. Luiten et al. (2020) - HOTA: Higher order tracking accuracy


Overall Assessment: This is a high-quality computer vision paper with substantial innovations in task definition, dataset construction, and method design. The RMOT task has important theoretical significance and practical value; Refer-KITTI-V2 provides valuable resources for the community; TempRMOT framework is well-designed and effective. Main limitations lie in scene constraints and unknown real-time performance. Recommended future work includes extending to more domains and conducting deeper theoretical analysis. This paper is expected to become an important reference in language-guided video understanding.