2025-11-17T17:07:12.969103

Less is More: Token Context-aware Learning for Object Tracking

Xu, Zhong, Liang et al.

Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.

academic

Less is More: Token Context-aware Learning for Object Tracking

基本信息

论文ID: 2501.00758
标题: Less is More: Token Context-aware Learning for Object Tracking
作者: Chenlong Xu, Bineng Zhong, Qihua Liang, Yaozong Zheng, Guorong Li, Shuxiang Song
分类: cs.CV (Computer Vision)
发表时间/会议: AAAI 2025
论文链接: https://arxiv.org/abs/2501.00758
代码链接: https://github.com/XuChenLong/LMTrack

摘要

本文提出了一种新的基于token上下文感知的目标跟踪方法LMTrack。现有的上下文感知跟踪方法通常通过多帧信息捕获上下文，但这些朴素的帧级上下文方法忽略了参考帧内各patch的重要性差异，容易受到噪声和冗余token的影响。LMTrack遵循"少即是多"的原则，通过分析所有参考token的重要性分布，收集、持续关注和更新重要token。该方法包含两个核心组件：Token Context Memory (TCM)模块和单向Token注意力机制，在多个跟踪基准上达到了最先进的性能。

研究背景与动机

问题定义

目标跟踪任务旨在基于初始位置在视频序列中定位和跟踪任意目标。近年来研究表明，利用上下文信息感知目标状态对目标跟踪至关重要。

现有方法局限性

帧级上下文的粗粒度性：现有方法将帧作为上下文的最小单位，忽略了参考帧内各patch对搜索帧中目标定位的重要性差异
冗余信息干扰：等同对待所有参考token会增加模型的感知和计算负担，特别是在复杂场景中
缺乏自适应性：使用手工策略让跟踪器被动接受参考帧，而非让跟踪器自主决策目标参考信息

研究动机

通过设计简单的Transformer跟踪器分析发现：大多数背景token在跟踪过程中很少被引用，对结果影响微小，而目标token作为长期参考线索被大量保留。这验证了少数高质量token在跟踪过程中起关键作用的假设。

核心贡献

提出了新的token上下文感知跟踪管道LMTrack：基于Token Context Memory模块，与现有基于帧级上下文的跟踪方法不同，LMTrack自动收集和更新高质量的token上下文用于视觉跟踪
引入了有效的单向注意力机制：以单向传播方式建立参考token与搜索帧之间的依赖关系，实现鲁棒的跨帧关联和定位
达到了最先进的跟踪性能：在LaSOT、TrackingNet、GOT10K、LaSOText、VOT2020五个视觉跟踪基准上取得了新的最优结果

方法详解

任务定义

给定初始目标位置，在视频序列中持续定位和跟踪该目标。输入为视频帧序列，输出为每帧中目标的边界框。

模型架构

整体框架

LMTrack采用自回归的token上下文感知跟踪框架，包含三个主要组件：

带单向注意力机制的骨干网络
Token Context Memory (TCM)模块
预测头

自回归跟踪过程

跟踪过程定义为：

R₀ = f(I₀, ∅), t = 0
Bₜ, Rₜ = f(Iₜ, Rₜ₋₁) = f(Iₜ, f(Iₜ₋₁, Rₜ₋₂)), t > 0

其中R表示参考token，I表示图像帧，B表示预测边界框。

Token Context Memory (TCM)模块

TCM模块分为三个步骤：

步骤1：从参考token中收集重要token

W = Σⱼ₌₁ᴸ Aⱼ × C
R' = Topk(Rank(R, W))

其中A是跨注意力矩阵，C是分类得分图，W表示重要性分布。

步骤2：整合分类图和搜索token

S' = S + CᵦᵢₙEₜₐᵣgₑₜ + (1 - Cᵦᵢₙ)Eᵦₐcₖgᵣₒᵤₙd

步骤3：更新参考token 合并步骤1和2的结果形成新的参考token Rₜ。

单向注意力机制

S = Softmax([QₛKᵣᵀ; QₛKₛᵀ]/√dₖ)[Vᵣ; Vₛ]

只允许参考token影响搜索token，保持参考token表示的一致性。

技术创新点

从帧级到token级上下文：摒弃传统的帧级上下文，使用细粒度的token级上下文表示重要的参考线索
自适应重要性分析：结合注意力矩阵和分类结果分析token重要性，而非使用固定策略
单向信息流：防止搜索token对参考token表示的污染，提高融合效率

实验设置

数据集

训练数据：LaSOT、GOT-10k、TrackingNet、COCO
测试基准：GOT-10K (180个测试序列)、TrackingNet (511个视频)、LaSOT (280个测试视频)、LaSOText (150个视频)、VOT2020 (60个挑战序列)

评价指标

GOT-10K：Average Overlap (AO)、Success Rate (SR)
LaSOT/LaSOText：Area Under Curve (AUC)、Precision (P)、Normalized Precision (PNorm)
TrackingNet：AUC、P、PNorm
VOT2020：Expected Average Overlap (EAO)、Accuracy、Robustness

实现细节

骨干网络：ViT-base
优化器：AdamW，学习率4×10⁻⁵ (骨干)，4×10⁻⁴ (其他)
训练：300个epoch，批大小16，Tesla A100 GPU
推理：默认每400帧检查参考更新，参考token最大长度为搜索token长度的2倍

实验结果

主要结果

GOT-10K基准

LMTrack384在GOT-10K上达到80.1% AO，相比之前最佳方法ARTrackV2的77.5% AO提升了2.6%。

其他基准性能

TrackingNet：85.7% AUC
LaSOT：73.2% AUC
LaSOText：53.6% AUC，相比ARTrackV2提升0.7%
VOT2020：58.6% EAO (LMTrack384)，55.0% EAO (LMTrack256)

效率对比

相比SeqTrack，LMTrack在相同分辨率下：

参数量：92M vs 89M
计算量：69G vs 148G FLOPs
推理速度：47fps vs 21fps

消融实验

#	Attention	Autoregressive	Update	AO(%)
1	bidirectional	×	-	73.0
2	unidirectional	×	-	73.9
3	unidirectional	×	update template	74.1
4	unidirectional	×	TCM	75.0
5	unidirectional	✓	update template	75.6
6	unidirectional	✓	TCM	76.3