2025-11-17T17:07:12.969103

Less is More: Token Context-aware Learning for Object Tracking

Xu, Zhong, Liang et al.

Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.

academic

Less is More: Token Context-aware Learning for Object Tracking

Basic Information

Paper ID: 2501.00758
Title: Less is More: Token Context-aware Learning for Object Tracking
Authors: Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, Shuxiang Song
Category: cs.CV (Computer Vision)
Publication Time/Conference: AAAI 2025
Paper Link: https://arxiv.org/abs/2501.00758
Code Link: https://github.com/XuChenLong/LMTrack

Abstract

This paper proposes LMTrack, a novel token context-aware object tracking method. Existing context-aware tracking methods typically capture context through multi-frame information; however, these naive frame-level context approaches overlook the importance variance of different patches within reference frames and are susceptible to noise and redundant tokens. LMTrack follows the principle of "less is more" by analyzing the importance distribution of all reference tokens, collecting, continuously focusing on, and updating important tokens. The method comprises two core components: the Token Context Memory (TCM) module and a unidirectional token attention mechanism, achieving state-of-the-art performance on multiple tracking benchmarks.

Research Background and Motivation

Problem Definition

Object tracking aims to locate and track arbitrary targets in video sequences based on initial position. Recent research demonstrates that leveraging context information to perceive target state is crucial for object tracking.

Limitations of Existing Methods

Coarse-grained nature of frame-level context: Existing methods treat frames as the minimal unit of context, overlooking the importance variance of different patches within reference frames for target localization in search frames
Interference from redundant information: Treating all reference tokens equally increases the model's perception and computational burden, particularly in complex scenarios
Lack of adaptability: Using hand-crafted strategies makes trackers passively accept reference frames rather than allowing trackers to autonomously decide on target reference information

Research Motivation

Through analysis of a simple Transformer tracker, it was discovered that most background tokens are rarely referenced during tracking and have minimal impact on results, while target tokens are largely retained as long-term reference cues. This validates the hypothesis that a small number of high-quality tokens play key roles in the tracking process.

Core Contributions

Proposes a novel token context-aware tracking pipeline LMTrack: Based on the Token Context Memory module, unlike existing frame-level context-based tracking methods, LMTrack automatically collects and updates high-quality token context for visual tracking
Introduces an effective unidirectional attention mechanism: Establishes dependencies between reference tokens and search frames through unidirectional propagation, achieving robust cross-frame association and localization
Achieves state-of-the-art tracking performance: Obtains new optimal results on five visual tracking benchmarks: LaSOT, TrackingNet, GOT10K, LaSOText, and VOT2020

Method Details

Task Definition

Given an initial target position, continuously locate and track the target in a video sequence. Input consists of video frame sequences, and output comprises bounding boxes of the target in each frame.

Model Architecture

Overall Framework

LMTrack adopts an autoregressive token context-aware tracking framework comprising three main components:

Backbone network with unidirectional attention mechanism
Token Context Memory (TCM) module
Prediction head

Autoregressive Tracking Process

The tracking process is defined as:

R₀ = f(I₀, ∅), t = 0
Bₜ, Rₜ = f(Iₜ, Rₜ₋₁) = f(Iₜ, f(Iₜ₋₁, Rₜ₋₂)), t > 0

where R represents reference tokens, I represents image frames, and B represents predicted bounding boxes.

Token Context Memory (TCM) Module

The TCM module consists of three steps:

Step 1: Collecting important tokens from reference tokens

W = Σⱼ₌₁ᴸ Aⱼ × C
R' = Topk(Rank(R, W))

where A is the cross-attention matrix, C is the classification score map, and W represents the importance distribution.

Step 2: Integrating classification map and search tokens

S' = S + CᵦᵢₙEₜₐᵣgₑₜ + (1 - Cᵦᵢₙ)Eᵦₐcₖgᵣₒᵤₙd

Step 3: Updating reference tokens Merging results from Steps 1 and 2 to form new reference tokens Rₜ.

Unidirectional Attention Mechanism

S = Softmax([QₛKᵣᵀ; QₛKₛᵀ]/√dₖ)[Vᵣ; Vₛ]

Only allows reference tokens to influence search tokens, maintaining consistency of reference token representations.

Technical Innovations

From frame-level to token-level context: Abandons traditional frame-level context in favor of fine-grained token-level context representation of important reference cues
Adaptive importance analysis: Combines attention matrices and classification results to analyze token importance rather than using fixed strategies
Unidirectional information flow: Prevents search tokens from contaminating reference token representations, improving fusion efficiency

Experimental Setup

Datasets

Training data: LaSOT, GOT-10k, TrackingNet, COCO
Testing benchmarks: GOT-10K (180 test sequences), TrackingNet (511 videos), LaSOT (280 test videos), LaSOText (150 videos), VOT2020 (60 challenge sequences)

Evaluation Metrics

GOT-10K: Average Overlap (AO), Success Rate (SR)
LaSOT/LaSOText: Area Under Curve (AUC), Precision (P), Normalized Precision (PNorm)
TrackingNet: AUC, P, PNorm
VOT2020: Expected Average Overlap (EAO), Accuracy, Robustness

Implementation Details

Backbone network: ViT-base
Optimizer: AdamW, learning rate 4×10⁻⁵ (backbone), 4×10⁻⁴ (others)
Training: 300 epochs, batch size 16, Tesla A100 GPU
Inference: Reference update checked every 400 frames by default, maximum reference token length is twice the search token length

Experimental Results

Main Results

GOT-10K Benchmark

LMTrack384 achieves 80.1% AO on GOT-10K, improving by 2.6% compared to the previous best method ARTrackV2's 77.5% AO.

Performance on Other Benchmarks

TrackingNet: 85.7% AUC
LaSOT: 73.2% AUC
LaSOText: 53.6% AUC, improving by 0.7% compared to ARTrackV2
VOT2020: 58.6% EAO (LMTrack384), 55.0% EAO (LMTrack256)

Efficiency Comparison

Compared to SeqTrack at the same resolution:

Parameters: 92M vs 89M
Computation: 69G vs 148G FLOPs
Inference speed: 47fps vs 21fps

Ablation Study

#	Attention	Autoregressive	Update	AO(%)
1	bidirectional	×	-	73.0
2	unidirectional	×	-	73.9
3	unidirectional	×	update template	74.1
4	unidirectional	×	TCM	75.0
5	unidirectional	✓	update template	75.6
6	unidirectional	✓	TCM	76.3

Key Findings:

Unidirectional attention: Improves by 0.9% AO compared to bidirectional attention, preventing noise propagation from search to reference
Autoregressive tracking: Improves by 1.3-1.5% AO compared to traditional methods
TCM module: Improves by 0.7-0.9% AO compared to template update strategies

Visualization Analysis

TCM Module Visualization

Demonstrates the process of TCM module extracting important reference tokens over time, with most background tokens becoming unimportant while primarily retaining tokens describing target appearance.

Attention Comparison

Comparison with OSTrack shows that LMTrack using reference tokens better resists appearance changes and distractors, maintaining focus on the target.

Traditional Tracking Frameworks

Early methods primarily relied on initial template methods, such as Siamese networks matching initial target templates with candidate regions, but struggled to adapt to significant target appearance changes.

Temporal Context-aware Tracking

To handle appearance changes, many trackers model visual tracking as an online learning problem:

UpdateNet: Uses custom networks to fuse accumulated templates
ATOM: Adds IoU prediction branch to constrain template selection
STMTrack: Updates dynamic templates at fixed intervals
SeqTrack: Uses likelihood-based strategies to select dynamic templates

Limitations of these methods:

Updating templates based on bounding box cropping easily introduces noise
Using manual methods or additional discriminative models to update templates fails to distinguish which context is important for tracking

Conclusions and Discussion

Main Conclusions

LMTrack significantly improves tracking performance through token-level context awareness
The TCM module effectively collects and updates important reference tokens
The unidirectional attention mechanism improves feature fusion efficiency and accuracy
Achieves state-of-the-art performance on multiple benchmarks while improving computational efficiency

Limitations

Computational complexity: Although more efficient than SeqTrack, still requires maintaining and updating reference tokens
Hyperparameter sensitivity: Selection of k value and update frequency may affect performance
Long-term tracking: Reference token management strategies in extremely long sequences require further optimization

Future Directions

Explore more efficient token importance assessment methods
Research adaptive reference token length control strategies
Extend to multi-object tracking scenarios

In-Depth Evaluation

Strengths

Strong innovation: The transition from frame-level to token-level context is an important innovation
Solid theoretical foundation: Experimental validation of the importance token distribution hypothesis
Comprehensive experiments: Thorough evaluation on multiple benchmarks with detailed ablation studies
High practical value: Improves performance while enhancing computational efficiency
Clear visualization: Effectively demonstrates the working principles of the method

Weaknesses

Method complexity: The TCM module design is relatively complex, potentially affecting implementation and tuning
Parameter sensitivity: Multiple hyperparameters (k value, update frequency, etc.) require careful tuning
Insufficient theoretical analysis: Lacks theoretical analysis of method convergence and stability
Limited scope: Primarily targets single-object tracking; applicability to multi-object scenarios remains unverified

Impact

Academic contribution: Provides new research directions for context-aware tracking
Practical value: Maintains high performance while improving efficiency
Reproducibility: Provides complete implementation details and code

Applicable Scenarios

Real-time tracking applications: High inference speed suitable for real-time scenarios
Long-term tracking tasks: Adaptive token management suitable for long-sequence tracking
Complex environment tracking: Effectively handles appearance changes and distractors

References

This paper cites important works in the object tracking field, including:

Siamese network series (SiamRPN++, SiamFC++)
Transformer trackers (TransT, STARK, Mixformer)
Context-aware methods (STMTrack, SeqTrack, OSTrack)
Attention mechanisms (Transformer, ViT)

Overall Assessment: This is a high-quality computer vision paper. The proposed LMTrack method demonstrates excellence in both theoretical innovation and experimental validation. The design philosophy of "less is more" and token-level context awareness provide new research directions for the object tracking field, possessing significant academic value and practical significance.