Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
Rios, Yuanda, Ghanz et al.
Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}
academic
Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
This paper addresses the computational efficiency challenges in ultra-fine-grained image recognition (UFGIR) tasks by proposing a novel Cross-Layer Cache Aggregation (CLCA) method. UFGIR is an extremely challenging task requiring classification within macro-categories, such as plant variety identification. While Vision Transformer-based methods achieve excellent performance on this task, they incur significantly increased computational costs. To address information loss during token reduction, this paper proposes a Cross-Layer Aggregation (CLA) classification head and Cross-Layer Cache (CLC) mechanism. Validated through over 2000 experiments, the method maintains accuracy comparable to existing state-of-the-art approaches even in extreme cases where token retention drops to 10%.
Core Problem: Computational efficiency of Vision Transformers in ultra-fine-grained image recognition (UFGIR)
Task Characteristics: UFGIR is more challenging than traditional fine-grained recognition, requiring distinction of sub-categories within the same species (e.g., plant varieties)
Existing Challenges:
ViT demonstrates superior performance on FGIR tasks but has computational complexity of O(N²) or even O(N³)
High-resolution images are crucial for fine-grained recognition but further increase computational burden
Token reduction techniques reduce computational costs but inevitably lead to loss of critical discriminative information
Existing token reduction methods inevitably lose information critical for fine-grained classification while reducing computational costs. This information loss becomes more severe as token retention rates decrease, affecting model classification performance.
Proposes Cross-Layer Aggregation (CLA) Classification Head: Directly integrates features from intermediate Transformer layers into the classification module, providing richer discriminative information
Designs Cross-Layer Cache (CLC) Mechanism: Stores and restores critical information from previous layers, compensating for information loss during token reduction
Constructs Plug-and-Play CLCA Framework: A complete method combining CLA and CLC that is compatible with multiple token reduction techniques
Comprehensive Experimental Validation: Over 2000 experiments across 5 datasets, 9 backbone networks, and 7 token reduction methods, demonstrating method effectiveness and generalizability
Divides images into patches of size P×P, flattened into sequences of length N=(S₁/P)×(S₂/P)
Adds learnable CLS token and positional encoding
Partitions L transformer encoder layers into g groups, each containing multi-head self-attention (MHSA) and position-wise feed-forward networks (PWFFN)
Applies token reduction operations at the final layer of each group
This paper cites 32 important references covering classical works in fine-grained recognition, Vision Transformers, and token reduction, providing a solid theoretical foundation for the research.