2025-11-23T17:13:17.428108

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

Zhang, Xiao, Tang et al.

Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9$\times$ speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7$\times$ acceleration coupled with a notable FID reduction of 2.17.

academic

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

Basic Information

Paper ID: 2501.00375
Title: Token Pruning for Caching Better: 9× Acceleration on Stable Diffusion for Free
Authors: Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, Linfeng Zhang
Categories: cs.CV (Computer Vision), cs.LG (Machine Learning)
Publication Date: December 31, 2024
Paper Link: https://arxiv.org/abs/2501.00375
Code Link: github.com/EvelynZhang-epiclab/DaTo

Abstract

Stable Diffusion has achieved remarkable success in text-to-image generation, yet its iterative denoising mechanism incurs substantial computational costs and slow generation speeds. Although feature caching methods have gained attention due to their effectiveness and simplicity, naively reusing features computed at previous timesteps causes features at adjacent timesteps to become similar, reducing feature dynamics over time and ultimately degrading the quality of generated images. This paper proposes a Dynamics-aware Token pruning (DaTo) method to address the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamics tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. When applied to Stable Diffusion on ImageNet, the method achieves 9× acceleration with FID reduction of 0.33; on COCO-30k, 7× acceleration is observed with significant FID reduction of 2.17.

Research Background and Motivation

Problem Background

Diffusion models have made significant advances in generative modeling with widespread applications in text-to-image generation, video generation, and other tasks. However, the iterative denoising mechanism of diffusion models leads to enormous computational costs and slow generation speeds, limiting their broader applications.

Limitations of Existing Methods

Current approaches for accelerating diffusion models primarily include:

Reducing sampling steps: Such as fast samplers like DDIM
Reducing computation per step: Including knowledge distillation, structural pruning, quantization, token pruning, and feature caching

Among these, feature caching is widely popular due to its effectiveness and simplicity, storing features computed at previous timesteps and reusing them in subsequent timesteps. However, feature reuse forces features across different timesteps to have similar values, reducing feature dynamics along the temporal dimension, damaging the original diffusion process and degrading generation quality.

Research Motivation

Through experimental observations, the paper discovers that compared to the original Stable Diffusion, models using feature caching exhibit significantly reduced feature differences between adjacent timesteps. This raises a critical question: Can feature caching be performed while still maintaining proper feature dynamics?

Core Contributions

Proposes Dynamics-aware Token pruning (DaTo) method: By pruning tokens whose dynamics are reduced by feature caching across different timesteps and recovering them with high-dynamics tokens, it avoids quality degradation caused by feature caching.
Designs evolutionary search strategy: Proposes searching for optimal feature caching and token pruning strategies through evolutionary methods, fully unleashing DaTo's potential.
Achieves significant performance improvements: Extensive experiments on Stable Diffusion and SDXL demonstrate that without retraining and additional data, up to 9× acceleration can be achieved on Stable Diffusion with lossless generation quality.

Methodology Details

Task Definition

The task is to significantly accelerate the inference process of the Stable Diffusion model while maintaining image generation quality. The input is a text prompt, the output is the corresponding high-quality image, with the constraint that no model retraining is required.

Model Architecture

1. Dynamics-aware Token Pruning (DaTo)

Base Token Selection:

Temporal Noise Difference Score: For the t-th timestep, compute the absolute difference of outputs from two adjacent previous timesteps:
```
DiffScore = (1/C) * Σ|f_up_0(x_{t+2})_c - f_up_0(x_{t+1})_c|
```
Patch-based Token Selection: Divide the image into non-overlapping s×s patches, selecting the token with the highest DiffScore in each patch as the base token.

CFG Alignment: To handle Classifier-Free Guidance (CFG), copy the base token positions from conditional generation to unconditional generation:

X_base,i,j[k] = X_base,i,j[k - B/2], k ∈ {B/2, B/2+1, ..., B-1}

Pruning Token Selection: Select K tokens most similar to base tokens based on cosine similarity for pruning:

X_prune = arg topK max Cosine_Similarity(X_i, X_j)

Pruning Token Recovery: Recover pruned tokens by directly copying their most similar base tokens.

2. Timestep-aware Feature Caching

Search Space Pruning:

Cache depth d restricted to {0, 1, 1/2}
Pruning ratio r restricted to {0.3, 0.4, 0.5, 0.6, 0.7}

Evolutionary Search Algorithm: Uses NSGA-II multi-objective optimization algorithm with optimization objectives including:

Inference latency
Generation quality (FID)

The search process includes standard evolutionary operations such as selection, crossover, and mutation, ultimately obtaining the optimal timestep-aware strategy F(t).

Technical Innovations

Dynamics Recovery Mechanism: By selectively pruning low-dynamics tokens and recovering them with high-dynamics tokens, successfully restores the feature dynamics distribution disrupted by feature caching.
Unified Caching-Pruning Framework: Combines feature caching and token pruning in a training-free framework, achieving information reuse at both temporal and token levels.
Adaptive Strategy Search: For different redundancy characteristics at different timesteps, proposes automatic methods to search for optimal cache depth and pruning ratios.

Experimental Setup

Datasets

ImageNet-1k: Generate 2000 512×512 images (2 per class)
COCO-30k: Generate 30,000 images (1 per caption)
MS COCO Validation Set: For SDXL evaluation, generate 5k 1024×1024 images

Evaluation Metrics

FID (Fréchet Inception Distance): Measures generation quality
CLIP Score: Evaluates text-image alignment
Inception Score: Image quality assessment
Latency and Acceleration Ratio: Efficiency evaluation

Comparison Methods

DDIM/DPM: Fast samplers
ToMeSD: Token merging method
DeepCache: Feature caching method
DeepCache & ToMeSD: Naive combination method

Implementation Details

NSGA-II evolutionary algorithm with population size 20, running 100 generations
CFG scale: 7.5 (SD v1.5), 9.0 (SD v2), 7.0 (SDXL)
Sampling steps: 50 PLMS
Tested on single 4090 GPU

Experimental Results

Main Results

Stable Diffusion v1.5 (ImageNet):

Configuration e1: 9.01× acceleration, FID reduced from 27.64 to 27.31
Outperforms comparison methods across all configurations

Stable Diffusion v2 (ImageNet):

Configuration e2: 7.25× acceleration, FID of 28.20
FID reduced from 29.8 to 28.20 compared to original model

COCO-30k Dataset:

SD v1.5: 7× acceleration, FID reduced from 12.15 to 9.98 (reduction of 2.17)
SD v2: 7.25× acceleration, FID from 13.68 to 13.88

SDXL (MS COCO):

2.32× acceleration, FID reduced from 24.25 to 23.10
Significantly outperforms DeepCache (1.75×) and DeepCache&ToMeSD (1.78×)

Ablation Studies

Effectiveness of DiffScore: Using DiffScore consistently improves FID scores across different cache settings and pruning ratios, validating the effectiveness of the temporal noise difference score.

Impact of CFG Alignment: As pruning ratio increases, benefits from CFG alignment configuration gradually increase, with FID improvements ranging from 13 to 30 points at high pruning ratios (0.7).

Case Analysis

Visual comparison results demonstrate DaTo's excellence in multiple aspects:

Content Fidelity: Highly similar to original image content
Detail Preservation: Maintains fine textures in high-detail scenes
Style Adaptation: Balances content preservation and style accuracy in image-to-image tasks
Prompt Alignment: Accurately generates all elements from complex text prompts

Experimental Findings

Feature Dynamics Recovery: DaTo successfully restores feature difference distribution to levels close to original Stable Diffusion
Sparse Coding Effect: Moderate token pruning and feature caching can improve model performance by focusing on critical features
Strategy Generalization: Strategies searched on SD v1.5 perform well on SDXL and other datasets

Efficient Diffusion Models

Sampling Step Reduction: DDIM, consistency models, etc.
Network Compression: Quantization, pruning, distillation, etc.
Architecture Optimization: U-Net improvements, Transformer optimization, etc.

Token Reduction Strategies

Learning Methods: DynamicViT, A-ViT, etc. using auxiliary models for ranking and pruning
Heuristic Methods: Token Pooling, Token Merging, etc. training-free methods
Diffusion Model Applications: ToMeSD, AT-EDM, etc. adapted for generative tasks

Caching Mechanisms

U-Net Caching: DeepCache leverages temporal redundancy for feature caching
DiT Caching: Δ-DiT caching strategy for Diffusion Transformers
Optimization Challenges: Balancing efficiency gains with generation quality preservation

Conclusions and Discussion

Main Conclusions

DaTo successfully addresses the feature dynamics loss problem caused by feature caching
Adaptive strategies obtained through evolutionary search significantly outperform fixed configurations
The method achieves significant acceleration and quality improvements across multiple models and datasets

Limitations

Search Cost: While ≤20 GPU hours is acceptable, additional computational resources are still required
Hardware Dependency: Performance improvements may vary with hardware configurations
Limitations in Extreme Settings: Excessive pruning ratios or very low cache update frequencies degrade performance

Future Directions

Adaptive Strategy Learning: Develop more intelligent adaptive caching and pruning strategies
Architecture Adaptation: Extend to more diffusion model architectures
Theoretical Analysis: Deeper understanding of sparse coding principles in diffusion models

In-Depth Evaluation

Strengths

Strong Innovation: First systematic solution to the feature dynamics loss problem in feature caching
Practical Method: Training-free, easy to deploy and integrate
Comprehensive Experiments: Full evaluation across multiple models and datasets
Theoretical Support: Provides theoretical explanation through sparse coding
Open Source Friendly: Provides complete code implementation

Weaknesses

Insufficient Theoretical Analysis: Relatively simple theoretical explanation for why the method improves FID
Search Algorithm Dependency: Requires evolutionary search to find optimal strategies, increasing usage complexity
Limited Evaluation Metrics: Primarily relies on FID for quality assessment, lacking more diverse quality indicators
Missing User Studies: No human evaluation to validate generation quality

Impact

Academic Value: Provides new perspectives and methods for diffusion model acceleration
Practical Value: Directly applicable to existing Stable Diffusion models
Reproducibility: Provides detailed implementation details and open-source code
Inspirational: Exemplifies token-level optimization applications in generative models

Applicable Scenarios

Resource-Constrained Environments: Mobile devices, edge computing scenarios
Real-time Applications: Interactive applications requiring fast generation
Batch Generation: Large-scale image generation tasks
Research Prototypes: Research projects requiring rapid iteration

References

The paper cites 46 related references covering multiple relevant domains including diffusion models, token reduction, and caching mechanisms, providing solid theoretical foundation and comparison benchmarks for this research.

Overall Assessment: This is a high-quality computer vision paper that proposes an innovative solution to the important problem of diffusion model acceleration. The method design is ingenious, experimental evaluation is comprehensive, and practical value is prominent. Although theoretical analysis depth could be improved, its practical contributions and impact are noteworthy.