2025-11-19T15:28:14.078632

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Rios, Yuanda, Ghanz et al.
Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}
academic

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Basic Information

  • Paper ID: 2501.00243
  • Title: Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
  • Authors: Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu
  • Category: cs.CV
  • Publication Date: December 31, 2024
  • Paper Link: https://arxiv.org/abs/2501.00243
  • Code Link: https://github.com/arkel23/CLCA

Abstract

This paper addresses the computational efficiency challenges in ultra-fine-grained image recognition (UFGIR) tasks by proposing a novel Cross-Layer Cache Aggregation (CLCA) method. UFGIR is an extremely challenging task requiring classification within macro-categories, such as plant variety identification. While Vision Transformer-based methods achieve excellent performance on this task, they incur significantly increased computational costs. To address information loss during token reduction, this paper proposes a Cross-Layer Aggregation (CLA) classification head and Cross-Layer Cache (CLC) mechanism. Validated through over 2000 experiments, the method maintains accuracy comparable to existing state-of-the-art approaches even in extreme cases where token retention drops to 10%.

Research Background and Motivation

Problem Definition

  1. Core Problem: Computational efficiency of Vision Transformers in ultra-fine-grained image recognition (UFGIR)
  2. Task Characteristics: UFGIR is more challenging than traditional fine-grained recognition, requiring distinction of sub-categories within the same species (e.g., plant varieties)
  3. Existing Challenges:
    • ViT demonstrates superior performance on FGIR tasks but has computational complexity of O(N²) or even O(N³)
    • High-resolution images are crucial for fine-grained recognition but further increase computational burden
    • Token reduction techniques reduce computational costs but inevitably lead to loss of critical discriminative information

Research Motivation

Existing token reduction methods inevitably lose information critical for fine-grained classification while reducing computational costs. This information loss becomes more severe as token retention rates decrease, affecting model classification performance.

Core Contributions

  1. Proposes Cross-Layer Aggregation (CLA) Classification Head: Directly integrates features from intermediate Transformer layers into the classification module, providing richer discriminative information
  2. Designs Cross-Layer Cache (CLC) Mechanism: Stores and restores critical information from previous layers, compensating for information loss during token reduction
  3. Constructs Plug-and-Play CLCA Framework: A complete method combining CLA and CLC that is compatible with multiple token reduction techniques
  4. Comprehensive Experimental Validation: Over 2000 experiments across 5 datasets, 9 backbone networks, and 7 token reduction methods, demonstrating method effectiveness and generalizability

Method Details

Task Definition

Input: High-resolution image I ∈ R^(H×W×3) Output: Ultra-fine-grained category prediction y ∈ {1,2,...,C} Constraint: Significantly reduce computational cost (FLOPs) while maintaining high accuracy

Model Architecture

1. Vision Transformer Encoder Groups

  • Divides images into patches of size P×P, flattened into sequences of length N=(S₁/P)×(S₂/P)
  • Adds learnable CLS token and positional encoding
  • Partitions L transformer encoder layers into g groups, each containing multi-head self-attention (MHSA) and position-wise feed-forward networks (PWFFN)
  • Applies token reduction operations at the final layer of each group

2. Cross-Layer Aggregation (CLA) Classification Head

The CLA head core design includes:

Input: CLS token outputs from each encoder group
1. Feature concatenation and reshaping: CLS ∈ R^(D×g)
2. Batch normalization processing
3. Depthwise convolution aggregation: Agg = DWConv(BN([CLS_G1; CLS_G2; ...; CLS_Gg]))
4. Non-linear activation: Models complex relationships through BatchNorm and GELU
5. Pointwise convolution classification: preds = PWConv(GELU(BN(Agg)))

3. Cross-Layer Cache (CLC) Mechanism

The CLC workflow:

Caching Phase:

  • After each transformer encoder block, stores global average pooling (GAP) of local features
  • Introduces learnable cross-layer register (CLR) tokens that aggregate cross-layer discriminative information
  • Stores GAP features and CLR tokens in the cache

Recovery Phase:

  • After token reduction positions or before the final layer, recovers stored information from CLC
  • Appends recovered tokens to the original sequence
  • Clears cache to prevent reuse

Technical Innovations

  1. Information Preservation Strategy: Preserves critical information lost during token reduction through caching mechanism
  2. Cross-Layer Feature Fusion: Directly integrates features from different depths into classification decisions
  3. Plug-and-Play Design: Seamlessly combines with existing multiple token reduction methods
  4. Gradient Optimization: Skip connection-like design improves training stability

Experimental Setup

Datasets

Uses 5 ultra-fine-grained leaf datasets:

  • SoyGene: Soybean genotype classification
  • SoyLocal: Local soybean varieties
  • SoyAgeing: Soybean aging stages
  • SoyGlobal: Global soybean varieties
  • Cotton: Cotton varieties

Each category represents confirmed variety names obtained from genetic resource repositories.

Evaluation Metrics

  • Primary Metric: Top-1 Accuracy (%)
  • Efficiency Metric: FLOPs (floating-point operations)
  • Statistical Method: Average results from 3 random seeds

Comparison Methods

SOTA Methods: ViT, DeiT, TransFG, SIM-Tr, CSDNet Token Reduction Methods:

  • Static pruning: DynamicViT
  • Dynamic pruning: ATS
  • Soft merging: SiT, PatchMerger
  • Hard merging: DPCKNN, ToMe
  • Attention-driven: EViT

Implementation Details

  • Optimizer: AdamW
  • Training Epochs: 50
  • Weight Decay: 0.05
  • Batch Size: 32
  • Image Size: 224×224, 448×448
  • Backbone Networks: 9 pretrained models (ViT, DeiT3, MIIL, MoCov3, DINO, MAE, CLIP, etc.)
  • Retention Rates: 100%, 70%, 50%, 25%, 10%
  • Token Reduction Positions: Layers 4, 7, 10 (12-layer ViT B-16)

Experimental Results

Main Results

MethodCottonSoyAgeingSoyGlobalFLOPs (10⁹)
ViT52.567.040.678.5
DeiT54.269.545.378.5
TransFG54.672.221.2447.9
CSDNet57.975.456.378.5
CLCA (10%)55.687.461.125.2
CLCA (70%)67.888.358.250.9

Key Findings:

  • CLCA achieves performance comparable to complete models even at 10% retention rate
  • On the SoyAgeing dataset, CLCA (10%) shows 12% improvement over the best baseline method
  • Computational cost reduced to 32% of original (25.2 vs 78.5 FLOPs)

Ablation Studies

Gradient analysis validates CLCA effectiveness:

  • Training Stability: CLCA significantly improves gradient stability, reducing oscillations during training
  • Feature Reuse: Cross-layer connections promote feature reuse, similar to skip connections in ResNet
  • Implicit Deep Supervision: Direct utilization of intermediate layer features provides implicit deep supervision

Generalization Verification

Experiments across different token reduction methods demonstrate:

  • CLCA compatibility with 7 different token reduction paradigms
  • Performance improvements across 9 different pretrained backbone networks
  • Consistent performance gains across different retention rates (25%, 50%, 70%)

Fine-Grained Image Recognition

  • Traditional FGIR: Primarily handles species-level classification
  • Ultra-Fine-Grained Recognition: Extends to sub-category classification within species, such as plant varieties
  • ViT in FGIR: Global receptive field advantages but high computational costs

Token Reduction Techniques

  • Token Pruning: Discards unimportant tokens based on importance scores
  • Token Merging: Merges multiple tokens into one, reducing sequence length
  • Existing Limitations: Inevitably loses discriminative information, especially at low retention rates

Conclusions and Discussion

Main Conclusions

  1. Efficiency Improvement: CLCA maintains competitive performance at extremely low token retention rates (10%)
  2. Generalizability: Method is compatible with multiple token reduction techniques and backbone networks
  3. Practical Value: Provides effective solutions for fine-grained recognition in resource-constrained environments

Limitations

  1. Additional Storage Overhead: CLC mechanism requires extra memory for storing intermediate features
  2. Hyperparameter Sensitivity: Caching strategies and aggregation methods may require task-specific tuning
  3. Dataset Limitations: Primarily validated on leaf datasets; generalization to other fine-grained domains requires further verification

Future Directions

  1. Adaptive Caching Strategies: Dynamically adjust cache content and timing based on task characteristics
  2. More Efficient Aggregation Mechanisms: Explore lightweight cross-layer feature fusion methods
  3. Multi-modal Extension: Extend methods to multi-modal fine-grained recognition tasks

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic solution to information loss in token reduction
  2. Comprehensive Experiments: Over 2000 experiments covering multiple dimensions with credible results
  3. High Practical Value: Plug-and-play design facilitates practical application
  4. Solid Theoretical Foundation: Effectiveness explained from gradient optimization and feature reuse perspectives

Weaknesses

  1. Storage Overhead: CLC mechanism increases memory usage, potentially offsetting efficiency gains
  2. Complexity: Introduces additional hyperparameters and design choices
  3. Domain Specificity: Primarily validated on agriculture-related leaf recognition; limited generalization

Impact

  1. Academic Value: Provides new insights and solutions for the token reduction field
  2. Practical Significance: Important for resource-constrained edge computing and mobile applications
  3. Reproducibility: Complete code implementation provided for subsequent research

Applicable Scenarios

  1. Edge Computing: Mobile devices and embedded systems with limited computational resources
  2. Real-time Applications: Fine-grained recognition tasks requiring rapid response
  3. Large-scale Deployment: Agricultural monitoring systems requiring deployment across numerous devices
  4. Research Platform: Enhancement module for other token reduction methods

References

This paper cites 32 important references covering classical works in fine-grained recognition, Vision Transformers, and token reduction, providing a solid theoretical foundation for the research.