2025-11-19T15:28:14.078632

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Rios, Yuanda, Ghanz et al.

Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}

academic

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Basic Information

Paper ID: 2501.00243
Title: Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
Authors: Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu
Category: cs.CV
Publication Date: December 31, 2024
Paper Link: https://arxiv.org/abs/2501.00243
Code Link: https://github.com/arkel23/CLCA

Abstract

This paper addresses the computational efficiency challenges in ultra-fine-grained image recognition (UFGIR) tasks by proposing a novel Cross-Layer Cache Aggregation (CLCA) method. UFGIR is an extremely challenging task requiring classification within macro-categories, such as plant variety identification. While Vision Transformer-based methods achieve excellent performance on this task, they incur significantly increased computational costs. To address information loss during token reduction, this paper proposes a Cross-Layer Aggregation (CLA) classification head and Cross-Layer Cache (CLC) mechanism. Validated through over 2000 experiments, the method maintains accuracy comparable to existing state-of-the-art approaches even in extreme cases where token retention drops to 10%.

Research Background and Motivation

Problem Definition

Core Problem: Computational efficiency of Vision Transformers in ultra-fine-grained image recognition (UFGIR)
Task Characteristics: UFGIR is more challenging than traditional fine-grained recognition, requiring distinction of sub-categories within the same species (e.g., plant varieties)
Existing Challenges:
- ViT demonstrates superior performance on FGIR tasks but has computational complexity of O(N²) or even O(N³)
- High-resolution images are crucial for fine-grained recognition but further increase computational burden
- Token reduction techniques reduce computational costs but inevitably lead to loss of critical discriminative information

Research Motivation

Existing token reduction methods inevitably lose information critical for fine-grained classification while reducing computational costs. This information loss becomes more severe as token retention rates decrease, affecting model classification performance.

Core Contributions

Proposes Cross-Layer Aggregation (CLA) Classification Head: Directly integrates features from intermediate Transformer layers into the classification module, providing richer discriminative information
Designs Cross-Layer Cache (CLC) Mechanism: Stores and restores critical information from previous layers, compensating for information loss during token reduction
Constructs Plug-and-Play CLCA Framework: A complete method combining CLA and CLC that is compatible with multiple token reduction techniques
Comprehensive Experimental Validation: Over 2000 experiments across 5 datasets, 9 backbone networks, and 7 token reduction methods, demonstrating method effectiveness and generalizability

Divides images into patches of size P×P, flattened into sequences of length N=(S₁/P)×(S₂/P)
Adds learnable CLS token and positional encoding
Partitions L transformer encoder layers into g groups, each containing multi-head self-attention (MHSA) and position-wise feed-forward networks (PWFFN)
Applies token reduction operations at the final layer of each group

2. Cross-Layer Aggregation (CLA) Classification Head

The CLA head core design includes:

Input: CLS token outputs from each encoder group
1. Feature concatenation and reshaping: CLS ∈ R^(D×g)
2. Batch normalization processing
3. Depthwise convolution aggregation: Agg = DWConv(BN([CLS_G1; CLS_G2; ...; CLS_Gg]))
4. Non-linear activation: Models complex relationships through BatchNorm and GELU
5. Pointwise convolution classification: preds = PWConv(GELU(BN(Agg)))

3. Cross-Layer Cache (CLC) Mechanism

The CLC workflow:

Caching Phase:

After each transformer encoder block, stores global average pooling (GAP) of local features
Introduces learnable cross-layer register (CLR) tokens that aggregate cross-layer discriminative information
Stores GAP features and CLR tokens in the cache

Recovery Phase:

After token reduction positions or before the final layer, recovers stored information from CLC
Appends recovered tokens to the original sequence
Clears cache to prevent reuse

Technical Innovations

Information Preservation Strategy: Preserves critical information lost during token reduction through caching mechanism
Cross-Layer Feature Fusion: Directly integrates features from different depths into classification decisions
Plug-and-Play Design: Seamlessly combines with existing multiple token reduction methods
Gradient Optimization: Skip connection-like design improves training stability

Experimental Setup

Datasets

Uses 5 ultra-fine-grained leaf datasets:

SoyGene: Soybean genotype classification
SoyLocal: Local soybean varieties
SoyAgeing: Soybean aging stages
SoyGlobal: Global soybean varieties
Cotton: Cotton varieties

Each category represents confirmed variety names obtained from genetic resource repositories.

Evaluation Metrics

Primary Metric: Top-1 Accuracy (%)
Efficiency Metric: FLOPs (floating-point operations)
Statistical Method: Average results from 3 random seeds

Comparison Methods

SOTA Methods: ViT, DeiT, TransFG, SIM-Tr, CSDNet Token Reduction Methods:

Static pruning: DynamicViT
Dynamic pruning: ATS
Soft merging: SiT, PatchMerger
Hard merging: DPCKNN, ToMe
Attention-driven: EViT

Implementation Details

Optimizer: AdamW
Training Epochs: 50
Weight Decay: 0.05
Batch Size: 32
Image Size: 224×224, 448×448
Backbone Networks: 9 pretrained models (ViT, DeiT3, MIIL, MoCov3, DINO, MAE, CLIP, etc.)
Retention Rates: 100%, 70%, 50%, 25%, 10%
Token Reduction Positions: Layers 4, 7, 10 (12-layer ViT B-16)

Experimental Results

Main Results

Method	Cotton	SoyAgeing	SoyGlobal	FLOPs (10⁹)
ViT	52.5	67.0	40.6	78.5
DeiT	54.2	69.5	45.3	78.5
TransFG	54.6	72.2	21.2	447.9
CSDNet	57.9	75.4	56.3	78.5
CLCA (10%)	55.6	87.4	61.1	25.2
CLCA (70%)	67.8	88.3	58.2	50.9

Key Findings:

CLCA achieves performance comparable to complete models even at 10% retention rate
On the SoyAgeing dataset, CLCA (10%) shows 12% improvement over the best baseline method
Computational cost reduced to 32% of original (25.2 vs 78.5 FLOPs)

Ablation Studies

Gradient analysis validates CLCA effectiveness:

Training Stability: CLCA significantly improves gradient stability, reducing oscillations during training
Feature Reuse: Cross-layer connections promote feature reuse, similar to skip connections in ResNet
Implicit Deep Supervision: Direct utilization of intermediate layer features provides implicit deep supervision

Generalization Verification

Experiments across different token reduction methods demonstrate:

CLCA compatibility with 7 different token reduction paradigms
Performance improvements across 9 different pretrained backbone networks
Consistent performance gains across different retention rates (25%, 50%, 70%)

Fine-Grained Image Recognition

Traditional FGIR: Primarily handles species-level classification
Ultra-Fine-Grained Recognition: Extends to sub-category classification within species, such as plant varieties
ViT in FGIR: Global receptive field advantages but high computational costs

Token Reduction Techniques

Token Pruning: Discards unimportant tokens based on importance scores
Token Merging: Merges multiple tokens into one, reducing sequence length
Existing Limitations: Inevitably loses discriminative information, especially at low retention rates

Conclusions and Discussion

Main Conclusions

Efficiency Improvement: CLCA maintains competitive performance at extremely low token retention rates (10%)
Generalizability: Method is compatible with multiple token reduction techniques and backbone networks
Practical Value: Provides effective solutions for fine-grained recognition in resource-constrained environments

Limitations

Additional Storage Overhead: CLC mechanism requires extra memory for storing intermediate features
Hyperparameter Sensitivity: Caching strategies and aggregation methods may require task-specific tuning
Dataset Limitations: Primarily validated on leaf datasets; generalization to other fine-grained domains requires further verification

Future Directions

Adaptive Caching Strategies: Dynamically adjust cache content and timing based on task characteristics
More Efficient Aggregation Mechanisms: Explore lightweight cross-layer feature fusion methods
Multi-modal Extension: Extend methods to multi-modal fine-grained recognition tasks

In-Depth Evaluation

Strengths

Strong Innovation: First systematic solution to information loss in token reduction
Comprehensive Experiments: Over 2000 experiments covering multiple dimensions with credible results
High Practical Value: Plug-and-play design facilitates practical application
Solid Theoretical Foundation: Effectiveness explained from gradient optimization and feature reuse perspectives

Weaknesses

Storage Overhead: CLC mechanism increases memory usage, potentially offsetting efficiency gains
Complexity: Introduces additional hyperparameters and design choices
Domain Specificity: Primarily validated on agriculture-related leaf recognition; limited generalization

Impact

Academic Value: Provides new insights and solutions for the token reduction field
Practical Significance: Important for resource-constrained edge computing and mobile applications
Reproducibility: Complete code implementation provided for subsequent research

Applicable Scenarios

Edge Computing: Mobile devices and embedded systems with limited computational resources
Real-time Applications: Fine-grained recognition tasks requiring rapid response
Large-scale Deployment: Agricultural monitoring systems requiring deployment across numerous devices
Research Platform: Enhancement module for other token reduction methods

References

This paper cites 32 important references covering classical works in fine-grained recognition, Vision Transformers, and token reduction, providing a solid theoretical foundation for the research.

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Basic Information

Abstract

Research Background and Motivation

Problem Definition

Research Motivation

Core Contributions

Method Details

Task Definition

Model Architecture

1. Vision Transformer Encoder Groups

2. Cross-Layer Aggregation (CLA) Classification Head

3. Cross-Layer Cache (CLC) Mechanism

Technical Innovations

Experimental Setup

Datasets

Evaluation Metrics

Comparison Methods

Implementation Details

Experimental Results

Main Results

Ablation Studies

Generalization Verification

Fine-Grained Image Recognition

Token Reduction Techniques

Conclusions and Discussion

Main Conclusions

Limitations

Future Directions

In-Depth Evaluation

Strengths

Weaknesses

Impact

Applicable Scenarios

References