2025-11-17T17:07:12.969103

Less is More: Token Context-aware Learning for Object Tracking

Xu, Zhong, Liang et al.
Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.
academic

Less is More: Token Context-aware Learning for Object Tracking

Basic Information

Abstract

This paper proposes LMTrack, a novel token context-aware object tracking method. Existing context-aware tracking methods typically capture context through multi-frame information; however, these naive frame-level context approaches overlook the importance variance of different patches within reference frames and are susceptible to noise and redundant tokens. LMTrack follows the principle of "less is more" by analyzing the importance distribution of all reference tokens, collecting, continuously focusing on, and updating important tokens. The method comprises two core components: the Token Context Memory (TCM) module and a unidirectional token attention mechanism, achieving state-of-the-art performance on multiple tracking benchmarks.

Research Background and Motivation

Problem Definition

Object tracking aims to locate and track arbitrary targets in video sequences based on initial position. Recent research demonstrates that leveraging context information to perceive target state is crucial for object tracking.

Limitations of Existing Methods

  1. Coarse-grained nature of frame-level context: Existing methods treat frames as the minimal unit of context, overlooking the importance variance of different patches within reference frames for target localization in search frames
  2. Interference from redundant information: Treating all reference tokens equally increases the model's perception and computational burden, particularly in complex scenarios
  3. Lack of adaptability: Using hand-crafted strategies makes trackers passively accept reference frames rather than allowing trackers to autonomously decide on target reference information

Research Motivation

Through analysis of a simple Transformer tracker, it was discovered that most background tokens are rarely referenced during tracking and have minimal impact on results, while target tokens are largely retained as long-term reference cues. This validates the hypothesis that a small number of high-quality tokens play key roles in the tracking process.

Core Contributions

  1. Proposes a novel token context-aware tracking pipeline LMTrack: Based on the Token Context Memory module, unlike existing frame-level context-based tracking methods, LMTrack automatically collects and updates high-quality token context for visual tracking
  2. Introduces an effective unidirectional attention mechanism: Establishes dependencies between reference tokens and search frames through unidirectional propagation, achieving robust cross-frame association and localization
  3. Achieves state-of-the-art tracking performance: Obtains new optimal results on five visual tracking benchmarks: LaSOT, TrackingNet, GOT10K, LaSOText, and VOT2020

Method Details

Task Definition

Given an initial target position, continuously locate and track the target in a video sequence. Input consists of video frame sequences, and output comprises bounding boxes of the target in each frame.

Model Architecture

Overall Framework

LMTrack adopts an autoregressive token context-aware tracking framework comprising three main components:

  • Backbone network with unidirectional attention mechanism
  • Token Context Memory (TCM) module
  • Prediction head

Autoregressive Tracking Process

The tracking process is defined as:

R₀ = f(I₀, ∅), t = 0
Bₜ, Rₜ = f(Iₜ, Rₜ₋₁) = f(Iₜ, f(Iₜ₋₁, Rₜ₋₂)), t > 0

where R represents reference tokens, I represents image frames, and B represents predicted bounding boxes.

Token Context Memory (TCM) Module

The TCM module consists of three steps:

Step 1: Collecting important tokens from reference tokens

W = Σⱼ₌₁ᴸ Aⱼ × C
R' = Topk(Rank(R, W))

where A is the cross-attention matrix, C is the classification score map, and W represents the importance distribution.

Step 2: Integrating classification map and search tokens

S' = S + CᵦᵢₙEₜₐᵣgₑₜ + (1 - Cᵦᵢₙ)Eᵦₐcₖgᵣₒᵤₙd

Step 3: Updating reference tokens Merging results from Steps 1 and 2 to form new reference tokens Rₜ.

Unidirectional Attention Mechanism

S = Softmax([QₛKᵣᵀ; QₛKₛᵀ]/√dₖ)[Vᵣ; Vₛ]

Only allows reference tokens to influence search tokens, maintaining consistency of reference token representations.

Technical Innovations

  1. From frame-level to token-level context: Abandons traditional frame-level context in favor of fine-grained token-level context representation of important reference cues
  2. Adaptive importance analysis: Combines attention matrices and classification results to analyze token importance rather than using fixed strategies
  3. Unidirectional information flow: Prevents search tokens from contaminating reference token representations, improving fusion efficiency

Experimental Setup

Datasets

  • Training data: LaSOT, GOT-10k, TrackingNet, COCO
  • Testing benchmarks: GOT-10K (180 test sequences), TrackingNet (511 videos), LaSOT (280 test videos), LaSOText (150 videos), VOT2020 (60 challenge sequences)

Evaluation Metrics

  • GOT-10K: Average Overlap (AO), Success Rate (SR)
  • LaSOT/LaSOText: Area Under Curve (AUC), Precision (P), Normalized Precision (PNorm)
  • TrackingNet: AUC, P, PNorm
  • VOT2020: Expected Average Overlap (EAO), Accuracy, Robustness

Implementation Details

  • Backbone network: ViT-base
  • Optimizer: AdamW, learning rate 4×10⁻⁵ (backbone), 4×10⁻⁴ (others)
  • Training: 300 epochs, batch size 16, Tesla A100 GPU
  • Inference: Reference update checked every 400 frames by default, maximum reference token length is twice the search token length

Experimental Results

Main Results

GOT-10K Benchmark

LMTrack384 achieves 80.1% AO on GOT-10K, improving by 2.6% compared to the previous best method ARTrackV2's 77.5% AO.

Performance on Other Benchmarks

  • TrackingNet: 85.7% AUC
  • LaSOT: 73.2% AUC
  • LaSOText: 53.6% AUC, improving by 0.7% compared to ARTrackV2
  • VOT2020: 58.6% EAO (LMTrack384), 55.0% EAO (LMTrack256)

Efficiency Comparison

Compared to SeqTrack at the same resolution:

  • Parameters: 92M vs 89M
  • Computation: 69G vs 148G FLOPs
  • Inference speed: 47fps vs 21fps

Ablation Study

#AttentionAutoregressiveUpdateAO(%)
1bidirectional×-73.0
2unidirectional×-73.9
3unidirectional×update template74.1
4unidirectional×TCM75.0
5unidirectionalupdate template75.6
6unidirectionalTCM76.3

Key Findings:

  1. Unidirectional attention: Improves by 0.9% AO compared to bidirectional attention, preventing noise propagation from search to reference
  2. Autoregressive tracking: Improves by 1.3-1.5% AO compared to traditional methods
  3. TCM module: Improves by 0.7-0.9% AO compared to template update strategies

Visualization Analysis

TCM Module Visualization

Demonstrates the process of TCM module extracting important reference tokens over time, with most background tokens becoming unimportant while primarily retaining tokens describing target appearance.

Attention Comparison

Comparison with OSTrack shows that LMTrack using reference tokens better resists appearance changes and distractors, maintaining focus on the target.

Traditional Tracking Frameworks

Early methods primarily relied on initial template methods, such as Siamese networks matching initial target templates with candidate regions, but struggled to adapt to significant target appearance changes.

Temporal Context-aware Tracking

To handle appearance changes, many trackers model visual tracking as an online learning problem:

  • UpdateNet: Uses custom networks to fuse accumulated templates
  • ATOM: Adds IoU prediction branch to constrain template selection
  • STMTrack: Updates dynamic templates at fixed intervals
  • SeqTrack: Uses likelihood-based strategies to select dynamic templates

Limitations of these methods:

  1. Updating templates based on bounding box cropping easily introduces noise
  2. Using manual methods or additional discriminative models to update templates fails to distinguish which context is important for tracking

Conclusions and Discussion

Main Conclusions

  1. LMTrack significantly improves tracking performance through token-level context awareness
  2. The TCM module effectively collects and updates important reference tokens
  3. The unidirectional attention mechanism improves feature fusion efficiency and accuracy
  4. Achieves state-of-the-art performance on multiple benchmarks while improving computational efficiency

Limitations

  1. Computational complexity: Although more efficient than SeqTrack, still requires maintaining and updating reference tokens
  2. Hyperparameter sensitivity: Selection of k value and update frequency may affect performance
  3. Long-term tracking: Reference token management strategies in extremely long sequences require further optimization

Future Directions

  1. Explore more efficient token importance assessment methods
  2. Research adaptive reference token length control strategies
  3. Extend to multi-object tracking scenarios

In-Depth Evaluation

Strengths

  1. Strong innovation: The transition from frame-level to token-level context is an important innovation
  2. Solid theoretical foundation: Experimental validation of the importance token distribution hypothesis
  3. Comprehensive experiments: Thorough evaluation on multiple benchmarks with detailed ablation studies
  4. High practical value: Improves performance while enhancing computational efficiency
  5. Clear visualization: Effectively demonstrates the working principles of the method

Weaknesses

  1. Method complexity: The TCM module design is relatively complex, potentially affecting implementation and tuning
  2. Parameter sensitivity: Multiple hyperparameters (k value, update frequency, etc.) require careful tuning
  3. Insufficient theoretical analysis: Lacks theoretical analysis of method convergence and stability
  4. Limited scope: Primarily targets single-object tracking; applicability to multi-object scenarios remains unverified

Impact

  1. Academic contribution: Provides new research directions for context-aware tracking
  2. Practical value: Maintains high performance while improving efficiency
  3. Reproducibility: Provides complete implementation details and code

Applicable Scenarios

  1. Real-time tracking applications: High inference speed suitable for real-time scenarios
  2. Long-term tracking tasks: Adaptive token management suitable for long-sequence tracking
  3. Complex environment tracking: Effectively handles appearance changes and distractors

References

This paper cites important works in the object tracking field, including:

  • Siamese network series (SiamRPN++, SiamFC++)
  • Transformer trackers (TransT, STARK, Mixformer)
  • Context-aware methods (STMTrack, SeqTrack, OSTrack)
  • Attention mechanisms (Transformer, ViT)

Overall Assessment: This is a high-quality computer vision paper. The proposed LMTrack method demonstrates excellence in both theoretical innovation and experimental validation. The design philosophy of "less is more" and token-level context awareness provide new research directions for the object tracking field, possessing significant academic value and practical significance.