Post-training quantization of vision encoders needs prefixing registers
Kim, Kim, Yeom et al.
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
academic
Post-training quantization of vision encoders needs prefixing registers
Title: Post-training quantization of vision encoders needs prefixing registers
Authors: Seunghyeon Kim (POSTECH), Jinho Kim (Dankook University), Taesun Yeom (POSTECH), Wonpyo Park (Google), Kyuyeun Kim (Google), Jaeho Lee (POSTECH)
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose RegCache, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
This research addresses the challenge of activation outliers in post-training quantization (PTQ) of Transformer-based vision encoders (such as CLIP and DINOv2). These outliers lead to degraded quantization precision, significantly impacting model performance even at 8-bit precision.
Practical Requirements: Vision encoders require real-time processing of massive visual data in edge device applications such as autonomous driving and robotic control
Computational Cost: Reducing inference cost is critical for deploying large-scale vision models on resource-constrained devices
Quantization Challenges: Activation quantization is more challenging than weight quantization, particularly in computationally constrained scenarios
Inapplicability of LLM Methods: Existing outlier mitigation strategies for large language models require different precision levels or quantization ranges, resulting in complex implementations and high computational overhead
Difficulty in Static Quantization: These methods are difficult to apply to static activation quantization
Specificity of Vision Encoders: Unlike language models, vision encoders lack predefined semantically meaningless tokens (such as <BOS>, <SEP>)
Proposes RegCache Algorithm: A training-free outlier mitigation algorithm that reduces outliers in vision encoders through prefix register tokens
Discovers Outlier Characteristics in Vision Encoders: Demonstrates that outlier behavior in vision encoders differs significantly from language models, with outliers appearing in middle layers rather than early layers
Technical Innovations: Introduces two key techniques: middle-layer prefixing and token deletion
Comprehensive Validation: Verifies the method's effectiveness across multiple text-supervised and self-supervised vision encoders
Given a pretrained vision encoder, the objective is to mitigate outliers in quantization-sensitive layers by introducing external register tokens, thereby improving the accuracy of quantized models while maintaining inference efficiency.
The solution is based on three important observations:
Layer-wise Quantization Sensitivity: Quantization sensitivity in vision encoders is concentrated in middle layers rather than early layers
Universality of Outlier Tokens: Outlier tokens appearing in middle layers exhibit high similarity across different images (cosine similarity 0.89 vs 0.26)
Middle-Layer Emergence Mechanism: Vision encoders require several initial layers to process images before identifying which tokens are semantically meaningless
This paper cites important works across quantization, attention mechanisms, and vision Transformers, including:
Original papers on vision encoders such as CLIP and DINOv2
ViT quantization methods such as PTQ4ViT and RepQ-ViT
Research on attention sink and register tokens
Outlier handling methods in LLM quantization
Overall Assessment: This is a high-quality paper with significant contributions to the field of vision encoder quantization. The authors not only propose an effective technical solution but also provide valuable theoretical insights into the essential differences in outlier behavior between vision encoders and language models, offering both useful theoretical understanding and practical tools for the field's development.