Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free
Zhang, Xiao, Tang et al.
Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9$\times$ speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7$\times$ acceleration coupled with a notable FID reduction of 2.17.
academic
Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free
Stable Diffusion has achieved remarkable success in text-to-image generation, yet its iterative denoising mechanism incurs substantial computational costs and slow generation speeds. Although feature caching methods have gained attention due to their effectiveness and simplicity, naively reusing features computed at previous timesteps causes features at adjacent timesteps to become similar, reducing feature dynamics over time and ultimately degrading the quality of generated images. This paper proposes a Dynamics-aware Token pruning (DaTo) method to address the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamics tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. When applied to Stable Diffusion on ImageNet, the method achieves 9× acceleration with FID reduction of 0.33; on COCO-30k, 7× acceleration is observed with significant FID reduction of 2.17.
Diffusion models have made significant advances in generative modeling with widespread applications in text-to-image generation, video generation, and other tasks. However, the iterative denoising mechanism of diffusion models leads to enormous computational costs and slow generation speeds, limiting their broader applications.
Current approaches for accelerating diffusion models primarily include:
Reducing sampling steps: Such as fast samplers like DDIM
Reducing computation per step: Including knowledge distillation, structural pruning, quantization, token pruning, and feature caching
Among these, feature caching is widely popular due to its effectiveness and simplicity, storing features computed at previous timesteps and reusing them in subsequent timesteps. However, feature reuse forces features across different timesteps to have similar values, reducing feature dynamics along the temporal dimension, damaging the original diffusion process and degrading generation quality.
Through experimental observations, the paper discovers that compared to the original Stable Diffusion, models using feature caching exhibit significantly reduced feature differences between adjacent timesteps. This raises a critical question: Can feature caching be performed while still maintaining proper feature dynamics?
Proposes Dynamics-aware Token pruning (DaTo) method: By pruning tokens whose dynamics are reduced by feature caching across different timesteps and recovering them with high-dynamics tokens, it avoids quality degradation caused by feature caching.
Designs evolutionary search strategy: Proposes searching for optimal feature caching and token pruning strategies through evolutionary methods, fully unleashing DaTo's potential.
Achieves significant performance improvements: Extensive experiments on Stable Diffusion and SDXL demonstrate that without retraining and additional data, up to 9× acceleration can be achieved on Stable Diffusion with lossless generation quality.
The task is to significantly accelerate the inference process of the Stable Diffusion model while maintaining image generation quality. The input is a text prompt, the output is the corresponding high-quality image, with the constraint that no model retraining is required.
Patch-based Token Selection: Divide the image into non-overlapping s×s patches, selecting the token with the highest DiffScore in each patch as the base token.
CFG Alignment:
To handle Classifier-Free Guidance (CFG), copy the base token positions from conditional generation to unconditional generation:
The search process includes standard evolutionary operations such as selection, crossover, and mutation, ultimately obtaining the optimal timestep-aware strategy F(t).
Dynamics Recovery Mechanism: By selectively pruning low-dynamics tokens and recovering them with high-dynamics tokens, successfully restores the feature dynamics distribution disrupted by feature caching.
Unified Caching-Pruning Framework: Combines feature caching and token pruning in a training-free framework, achieving information reuse at both temporal and token levels.
Adaptive Strategy Search: For different redundancy characteristics at different timesteps, proposes automatic methods to search for optimal cache depth and pruning ratios.
Effectiveness of DiffScore:
Using DiffScore consistently improves FID scores across different cache settings and pruning ratios, validating the effectiveness of the temporal noise difference score.
Impact of CFG Alignment:
As pruning ratio increases, benefits from CFG alignment configuration gradually increase, with FID improvements ranging from 13 to 30 points at high pruning ratios (0.7).
The paper cites 46 related references covering multiple relevant domains including diffusion models, token reduction, and caching mechanisms, providing solid theoretical foundation and comparison benchmarks for this research.
Overall Assessment: This is a high-quality computer vision paper that proposes an innovative solution to the important problem of diffusion model acceleration. The method design is ingenious, experimental evaluation is comprehensive, and practical value is prominent. Although theoretical analysis depth could be improved, its practical contributions and impact are noteworthy.