Less is More: Token Context-aware Learning for Object Tracking
Xu, Zhong, Liang et al.
Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.
academic
Less is More: Token Context-aware Learning for Object Tracking
This paper proposes LMTrack, a novel token context-aware object tracking method. Existing context-aware tracking methods typically capture context through multi-frame information; however, these naive frame-level context approaches overlook the importance variance of different patches within reference frames and are susceptible to noise and redundant tokens. LMTrack follows the principle of "less is more" by analyzing the importance distribution of all reference tokens, collecting, continuously focusing on, and updating important tokens. The method comprises two core components: the Token Context Memory (TCM) module and a unidirectional token attention mechanism, achieving state-of-the-art performance on multiple tracking benchmarks.
Object tracking aims to locate and track arbitrary targets in video sequences based on initial position. Recent research demonstrates that leveraging context information to perceive target state is crucial for object tracking.
Coarse-grained nature of frame-level context: Existing methods treat frames as the minimal unit of context, overlooking the importance variance of different patches within reference frames for target localization in search frames
Interference from redundant information: Treating all reference tokens equally increases the model's perception and computational burden, particularly in complex scenarios
Lack of adaptability: Using hand-crafted strategies makes trackers passively accept reference frames rather than allowing trackers to autonomously decide on target reference information
Through analysis of a simple Transformer tracker, it was discovered that most background tokens are rarely referenced during tracking and have minimal impact on results, while target tokens are largely retained as long-term reference cues. This validates the hypothesis that a small number of high-quality tokens play key roles in the tracking process.
Proposes a novel token context-aware tracking pipeline LMTrack: Based on the Token Context Memory module, unlike existing frame-level context-based tracking methods, LMTrack automatically collects and updates high-quality token context for visual tracking
Introduces an effective unidirectional attention mechanism: Establishes dependencies between reference tokens and search frames through unidirectional propagation, achieving robust cross-frame association and localization
Achieves state-of-the-art tracking performance: Obtains new optimal results on five visual tracking benchmarks: LaSOT, TrackingNet, GOT10K, LaSOText, and VOT2020
Given an initial target position, continuously locate and track the target in a video sequence. Input consists of video frame sequences, and output comprises bounding boxes of the target in each frame.
From frame-level to token-level context: Abandons traditional frame-level context in favor of fine-grained token-level context representation of important reference cues
Adaptive importance analysis: Combines attention matrices and classification results to analyze token importance rather than using fixed strategies
Unidirectional information flow: Prevents search tokens from contaminating reference token representations, improving fusion efficiency
Demonstrates the process of TCM module extracting important reference tokens over time, with most background tokens becoming unimportant while primarily retaining tokens describing target appearance.
Early methods primarily relied on initial template methods, such as Siamese networks matching initial target templates with candidate regions, but struggled to adapt to significant target appearance changes.
Overall Assessment: This is a high-quality computer vision paper. The proposed LMTrack method demonstrates excellence in both theoretical innovation and experimental validation. The design philosophy of "less is more" and token-level context awareness provide new research directions for the object tracking field, possessing significant academic value and practical significance.