2025-11-19T15:28:14.078632

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Rios, Yuanda, Ghanz et al.

Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}

academic

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

基本信息

论文ID: 2501.00243
标题: Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
作者: Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu
分类: cs.CV
发表时间: 2024年12月31日
论文链接: https://arxiv.org/abs/2501.00243
代码链接: https://github.com/arkel23/CLCA

摘要

本文针对超细粒度图像识别(UFGIR)任务中的计算效率问题，提出了一种新颖的跨层缓存聚合(CLCA)方法。UFGIR是一项极具挑战性的任务，需要在宏类别内进行分类，例如植物品种的识别。虽然基于Vision Transformer的方法在该任务上取得了优异性能，但计算成本显著增加。为了解决token reduction过程中信息丢失的问题，本文提出了跨层聚合分类头(CLA)和跨层缓存机制(CLC)，通过超过2000次实验验证，该方法能够在token保留率降至10%的极端情况下，仍保持与现有最先进方法相当的精度。

研究背景与动机

问题定义

核心问题: 超细粒度图像识别(UFGIR)中Vision Transformer的计算效率问题
任务特点: UFGIR比传统细粒度识别更加困难，需要区分同一物种内的子类别(如植物品种)
现有挑战:
- ViT在FGIR任务上表现优异，但计算复杂度为O(N²)甚至O(N³)
- 高分辨率图像对细粒度识别至关重要，但进一步增加了计算负担
- Token reduction技术虽能降低计算成本，但会导致关键判别信息丢失

研究动机

现有的token reduction方法在降低计算成本的同时，不可避免地丢失了对细粒度分类至关重要的信息。特别是当token保留率降低时，这种信息损失更加严重，影响了模型的分类性能。

核心贡献

提出跨层聚合(CLA)分类头: 将Transformer中间层的特征直接整合到分类模块中，提供更丰富的判别信息
设计跨层缓存(CLC)机制: 存储和恢复先前层的关键信息，补偿token reduction过程中的信息损失
构建即插即用的CLCA框架: 结合CLA和CLC的完整方法，可与多种token reduction技术兼容
大规模实验验证: 在5个数据集、9种骨干网络、7种token reduction方法上进行了超过2000次实验，证明了方法的有效性和通用性

将图像分割为大小为P×P的patch，展平为长度N=(S₁/P)×(S₂/P)的序列
添加可学习的CLS token和位置编码
将L层transformer编码器分为g组，每组包含多头自注意力(MHSA)和位置前馈网络(PWFFN)
在每组的最后一层应用token reduction操作

2. 跨层聚合(CLA)分类头

CLA头的核心设计包括：

输入: 各编码器组的CLS token输出
1. 特征连接和重塑: CLS ∈ R^(D×g)
2. 批标准化处理
3. 深度卷积聚合: Agg = DWConv(BN([CLS_G1; CLS_G2; ...; CLS_Gg]))
4. 非线性激活: 通过BatchNorm和GELU建模复杂关系
5. 点卷积分类: preds = PWConv(GELU(BN(Agg)))

3. 跨层缓存(CLC)机制

CLC的工作流程：

缓存阶段:

在每个transformer编码器块后，存储局部特征的全局平均池化(GAP)
引入可学习的跨层寄存器(CLR) token，聚合跨层判别信息
将GAP特征和CLR token存储到缓存中

恢复阶段:

在token reduction位置后或最后一层前，从CLC中恢复存储的信息
将恢复的token追加到原始序列中
清空缓存以避免重复使用

技术创新点

信息保持策略: 通过缓存机制保存token reduction过程中丢失的关键信息
跨层特征融合: 将不同深度的特征直接整合到分类决策中
即插即用设计: 可与现有的多种token reduction方法无缝结合
梯度优化: 类似skip connection的设计改善了训练稳定性

实验设置

数据集

使用5个超细粒度叶片数据集：

SoyGene: 大豆基因型分类
SoyLocal: 本地大豆品种
SoyAgeing: 大豆老化阶段
SoyGlobal: 全球大豆品种
Cotton: 棉花品种

每个类别代表从基因资源库获得的确认品种名称。

评价指标

主要指标: Top-1准确率(%)
效率指标: FLOPs (浮点运算次数)
统计方法: 3个随机种子的平均结果

对比方法

SOTA方法: ViT, DeiT, TransFG, SIM-Tr, CSDNet Token Reduction方法:

静态剪枝: DynamicViT
动态剪枝: ATS
软合并: SiT, PatchMerger
硬合并: DPCKNN, ToMe
注意力驱动: EViT

实现细节

优化器: AdamW
训练轮数: 50 epochs
权重衰减: 0.05
批大小: 32
图像尺寸: 224×224, 448×448
骨干网络: 9种预训练模型(ViT, DeiT3, MIIL, MoCov3, DINO, MAE, CLIP等)
保留率: 100%, 70%, 50%, 25%, 10%
Token reduction位置: 第4、7、10层(12层ViT B-16)

实验结果

主要结果

Method	Cotton	SoyAgeing	SoyGlobal	FLOPs (10⁹)
ViT	52.5	67.0	40.6	78.5
DeiT	54.2	69.5	45.3	78.5
TransFG	54.6	72.2	21.2	447.9
CSDNet	57.9	75.4	56.3	78.5
CLCA (10%)	55.6	87.4	61.1	25.2
CLCA (70%)	67.8	88.3	58.2	50.9