2025-11-20T02:10:14.805899

Post-training quantization of vision encoders needs prefixing registers

Kim, Kim, Yeom et al.

Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

academic

Post-training quantization of vision encoders needs prefixing registers

Basic Information

Paper ID: 2510.04547
Title: Post-training quantization of vision encoders needs prefixing registers
Authors: Seunghyeon Kim (POSTECH), Jinho Kim (Dankook University), Taesun Yeom (POSTECH), Wonpyo Park (Google), Kyuyeun Kim (Google), Jaeho Lee (POSTECH)
Classification: cs.LG, cs.CV
Publication Date: October 2025 (Preprint)
Paper Link: https://arxiv.org/abs/2510.04547v2

Abstract

Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$ , a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Research Background and Motivation

Problem Definition

This research addresses the challenge of activation outliers in post-training quantization (PTQ) of Transformer-based vision encoders (such as CLIP and DINOv2). These outliers lead to degraded quantization precision, significantly impacting model performance even at 8-bit precision.

Significance Analysis

Practical Requirements: Vision encoders require real-time processing of massive visual data in edge device applications such as autonomous driving and robotic control
Computational Cost: Reducing inference cost is critical for deploying large-scale vision models on resource-constrained devices
Quantization Challenges: Activation quantization is more challenging than weight quantization, particularly in computationally constrained scenarios

Limitations of Existing Methods

Inapplicability of LLM Methods: Existing outlier mitigation strategies for large language models require different precision levels or quantization ranges, resulting in complex implementations and high computational overhead
Difficulty in Static Quantization: These methods are difficult to apply to static activation quantization
Specificity of Vision Encoders: Unlike language models, vision encoders lack predefined semantically meaningless tokens (such as <BOS>, <SEP>)

Core Contributions

Proposes RegCache Algorithm: A training-free outlier mitigation algorithm that reduces outliers in vision encoders through prefix register tokens
Discovers Outlier Characteristics in Vision Encoders: Demonstrates that outlier behavior in vision encoders differs significantly from language models, with outliers appearing in middle layers rather than early layers
Technical Innovations: Introduces two key techniques: middle-layer prefixing and token deletion
Comprehensive Validation: Verifies the method's effectiveness across multiple text-supervised and self-supervised vision encoders

Methodology Details

Task Definition

Given a pretrained vision encoder, the objective is to mitigate outliers in quantization-sensitive layers by introducing external register tokens, thereby improving the accuracy of quantized models while maintaining inference efficiency.

Core Observations

The solution is based on three important observations:

Layer-wise Quantization Sensitivity: Quantization sensitivity in vision encoders is concentrated in middle layers rather than early layers
Universality of Outlier Tokens: Outlier tokens appearing in middle layers exhibit high similarity across different images (cosine similarity 0.89 vs 0.26)
Middle-Layer Emergence Mechanism: Vision encoders require several initial layers to process images before identifying which tokens are semantically meaningless

RegCache Algorithm Architecture

RegCache comprises three main steps:

1. Register Candidate Curation

S = argtopk{||z||∞ | z ∈ Φlq(x), for some x ∈ Iref}

Identifies quantization-sensitive layers lq through layer-wise quantization sensitivity analysis
Selects top-k tokens with maximum ℓ∞ norm from reference image pool as register candidates
Uses 50,000 randomly selected images from ImageNet-1k training set as reference pool

2. Caching

(z*, τ*) = argmax{accref(z,τ) | z ∈ S, τ ∈ {1,...,15}}

Computes key-value cache for each register candidate
Determines optimal register z* and repetition count τ* through grid search
Inserts selected KV cache into quantization-sensitive layers and subsequent layers

3. Deletion

D = argtopk̃{||z||∞ | z ∈ Φlq(xtest)}

Adds token deletion layer at the input of quantization-sensitive layers
Deletes top-k̃ internally occurring sink tokens with maximum ℓ∞ norm during inference

Technical Innovations

Middle-Layer Prefixing Strategy: Designed for middle-layer characteristics of vision encoders, differing from early-layer prefixing in LLMs
Universal Register Discovery: Leverages similarity of outlier tokens across different images to construct universal registers
Add-Delete Mechanism: Replaces internally occurring sink tokens with externally precomputed caches, avoiding impact on activation quantization range

Experimental Setup

Datasets

ImageNet-1k: For zero-shot image classification evaluation
MS-COCO: For image-text retrieval task evaluation
Other Classification Datasets: Stanford Cars, Flowers-102, Food-101, CIFAR-100 (for generalization validation)
Reference Data: 50,000 images from ImageNet-1k training set for register search

Evaluation Metrics

Zero-shot Classification Accuracy: Top-1 accuracy on ImageNet-1k
Retrieval Performance: Recall@1 and Recall@5 on MS-COCO
Outlier Analysis: Maximum token norm and average token norm

Baseline Methods

Baseline Quantization Algorithms:
- PTQ4ViT: Dual uniform quantizer for ViT
- RepQ-ViT: Scale reparameterization method
- NoisyQuant: Noise-enhanced activation quantization
Precision Settings: W8A8 (8-bit weights, 8-bit activations) and W6A6 (6-bit weights, 6-bit activations)

Implementation Details

Uses 1,024 and 32 calibration samples (for NoisyQuant and RepQ-ViT respectively)
Register candidate count k=20, repetition count range τ∈{1,...,15}
Token deletion count k̃ tuned through reference task

Experimental Results

Main Results

Zero-shot Image Classification (ImageNet-1k)

Model	Precision	Baseline Best	RegCache Best	Improvement
CLIP-B/16	W8A8	67.69%	67.78%	+0.09%
CLIP-B/16	W6A6	58.19%	66.65%	+13.40%
SigLIP2-B/16	W8A8	76.92%	77.26%	+0.34%
SigLIP2-B/16	W6A6	64.91%	70.88%	+5.97%

Image-Text Retrieval (MS-COCO)

CLIP-B/16: Average improvement of 3.76%-7.97% across all retrieval metrics
SigLIP-B/16: Recall@1 improvement of 0.20%, overall stable performance gains

Outlier Mitigation Effects

Model	Max Token Norm (Original)	Max Token Norm (RegCache)	Reduction Ratio
CLIP	61.17	15.30	-75.0%
OpenCLIP	122.99	12.38	-89.9%
SigLIP2	244.78	30.45	-87.6%

Ablation Study

Ablation studies on SigLIP show:

Prefix Cache Only: Accuracy improves from 69.71% to 74.21%
Token Deletion Only: Accuracy drops to 38.51% (demonstrating need for prefix support)
Complete RegCache: Accuracy reaches 74.42%

Generalization Validation

Prefixes searched on ImageNet-1k remain effective on other datasets:

Stanford Cars: +1.78% to +47.47%
Food-101: +9.85% to +51.28%
CIFAR-100: +12.81% to +33.00%

Transformer Outlier Research

Systematic study of activation outliers in large-scale Transformers
Outlier behavior of specific tokens in LLMs (such as <BOS>, <SEP>)
Outliers in ViT typically correspond to uninformative background patches

Attention Sink Control

Attention sink: Tokens that attract excessive attention but contain little semantic information
Adding register tokens during training to absorb attention and mitigate attention sink
This work leverages sink tokens from PTQ perspective to improve quantization performance

ViT Post-training Quantization

Early methods: Allocate dynamic bit-widths for attention-sensitive layers
Existing methods: Isolate and minimize outlier impact through specialized quantization schemes
This work: Address outliers through token prefixing rather than quantizer granularity control

Conclusions and Discussion

Main Conclusions

RegCache Effectiveness: Consistently improves performance across multiple vision encoders and quantization methods
Outlier Mitigation Mechanism: Successfully transfers outliers from internal tokens to external precomputed caches
Generalizability: Method applies to both text-supervised and self-supervised vision encoders

Limitations

Hyperparameter Tuning: Requires evaluating multiple prefix candidates to determine optimal configuration
Additional Hyperparameters: Introduces hyperparameters such as maximum token deletion count and prefix token count
Computational Overhead: While FLOPs increase by no more than 0.2%, additional computational cost remains

Future Directions

Multi-modal Difference Research: Deeper understanding of quantization behavior differences between text-supervised and self-supervised models
Outlier Mechanism Understanding: Further investigation of fundamental causes for outlier behavior differences between ViT and LLM
Automated Optimization: Develop methods to automatically determine optimal prefix configurations

In-depth Evaluation

Strengths

Problem Importance: Addresses critical technical challenges in vision encoder quantization
Method Innovation: First to introduce register concept to vision encoder quantization with novel technical approach
Theoretical Insights: Provides deep analysis of essential differences in outlier behavior between vision encoders and LLMs
Comprehensive Experiments: Covers 5 mainstream vision encoders and multiple quantization algorithms with convincing results
Practical Value: Training-free and easily integrated into existing quantization pipelines

Weaknesses

Limited Theoretical Analysis: Lacks deep theoretical explanation for why middle-layer prefixing is effective
Hyperparameter Sensitivity: Method involves multiple hyperparameters that may affect practical deployment convenience
Computational Overhead Analysis: While FLOPs increase is minimal, lacks detailed analysis of memory usage and latency
Limited Scope: Primarily validates ViT architecture; applicability to other vision Transformer architectures insufficiently verified

Impact

Academic Contribution: Provides new technical pathways and theoretical insights for vision encoder quantization
Practical Value: Directly applicable to deployment optimization of existing vision encoders
Reproducibility: Clear method description and detailed experimental setup ensure good reproducibility
Inspirational Value: Provides important reference for cross-modal model optimization technique transfer

Applicable Scenarios

Edge Deployment: Particularly suitable for deploying large-scale vision encoders on resource-constrained devices
Real-time Applications: Applications requiring low-latency visual processing such as autonomous driving and robotic control
Multimodal Systems: Quantization deployment of CLIP-like models in various downstream tasks
Research Tool: Provides effective baseline method for vision Transformer quantization research

References

This paper cites important works across quantization, attention mechanisms, and vision Transformers, including:

Original papers on vision encoders such as CLIP and DINOv2
ViT quantization methods such as PTQ4ViT and RepQ-ViT
Research on attention sink and register tokens
Outlier handling methods in LLM quantization

Overall Assessment: This is a high-quality paper with significant contributions to the field of vision encoder quantization. The authors not only propose an effective technical solution but also provide valuable theoretical insights into the essential differences in outlier behavior between vision encoders and language models, offering both useful theoretical understanding and practical tools for the field's development.