2025-11-20T02:10:14.805899

Post-training quantization of vision encoders needs prefixing registers

Kim, Kim, Yeom et al.
Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.
academic

Post-training quantization of vision encoders needs prefixing registers

Basic Information

  • Paper ID: 2510.04547
  • Title: Post-training quantization of vision encoders needs prefixing registers
  • Authors: Seunghyeon Kim (POSTECH), Jinho Kim (Dankook University), Taesun Yeom (POSTECH), Wonpyo Park (Google), Kyuyeun Kim (Google), Jaeho Lee (POSTECH)
  • Classification: cs.LG, cs.CV
  • Publication Date: October 2025 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.04547v2

Abstract

Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose RegCache\textit{RegCache}, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Research Background and Motivation

Problem Definition

This research addresses the challenge of activation outliers in post-training quantization (PTQ) of Transformer-based vision encoders (such as CLIP and DINOv2). These outliers lead to degraded quantization precision, significantly impacting model performance even at 8-bit precision.

Significance Analysis

  1. Practical Requirements: Vision encoders require real-time processing of massive visual data in edge device applications such as autonomous driving and robotic control
  2. Computational Cost: Reducing inference cost is critical for deploying large-scale vision models on resource-constrained devices
  3. Quantization Challenges: Activation quantization is more challenging than weight quantization, particularly in computationally constrained scenarios

Limitations of Existing Methods

  1. Inapplicability of LLM Methods: Existing outlier mitigation strategies for large language models require different precision levels or quantization ranges, resulting in complex implementations and high computational overhead
  2. Difficulty in Static Quantization: These methods are difficult to apply to static activation quantization
  3. Specificity of Vision Encoders: Unlike language models, vision encoders lack predefined semantically meaningless tokens (such as <BOS>, <SEP>)

Core Contributions

  1. Proposes RegCache Algorithm: A training-free outlier mitigation algorithm that reduces outliers in vision encoders through prefix register tokens
  2. Discovers Outlier Characteristics in Vision Encoders: Demonstrates that outlier behavior in vision encoders differs significantly from language models, with outliers appearing in middle layers rather than early layers
  3. Technical Innovations: Introduces two key techniques: middle-layer prefixing and token deletion
  4. Comprehensive Validation: Verifies the method's effectiveness across multiple text-supervised and self-supervised vision encoders

Methodology Details

Task Definition

Given a pretrained vision encoder, the objective is to mitigate outliers in quantization-sensitive layers by introducing external register tokens, thereby improving the accuracy of quantized models while maintaining inference efficiency.

Core Observations

The solution is based on three important observations:

  1. Layer-wise Quantization Sensitivity: Quantization sensitivity in vision encoders is concentrated in middle layers rather than early layers
  2. Universality of Outlier Tokens: Outlier tokens appearing in middle layers exhibit high similarity across different images (cosine similarity 0.89 vs 0.26)
  3. Middle-Layer Emergence Mechanism: Vision encoders require several initial layers to process images before identifying which tokens are semantically meaningless

RegCache Algorithm Architecture

RegCache comprises three main steps:

1. Register Candidate Curation

S = argtopk{||z||∞ | z ∈ Φlq(x), for some x ∈ Iref}
  • Identifies quantization-sensitive layers lq through layer-wise quantization sensitivity analysis
  • Selects top-k tokens with maximum ℓ∞ norm from reference image pool as register candidates
  • Uses 50,000 randomly selected images from ImageNet-1k training set as reference pool

2. Caching

(z*, τ*) = argmax{accref(z,τ) | z ∈ S, τ ∈ {1,...,15}}
  • Computes key-value cache for each register candidate
  • Determines optimal register z* and repetition count τ* through grid search
  • Inserts selected KV cache into quantization-sensitive layers and subsequent layers

3. Deletion

D = argtopk̃{||z||∞ | z ∈ Φlq(xtest)}
  • Adds token deletion layer at the input of quantization-sensitive layers
  • Deletes top-k̃ internally occurring sink tokens with maximum ℓ∞ norm during inference

Technical Innovations

  1. Middle-Layer Prefixing Strategy: Designed for middle-layer characteristics of vision encoders, differing from early-layer prefixing in LLMs
  2. Universal Register Discovery: Leverages similarity of outlier tokens across different images to construct universal registers
  3. Add-Delete Mechanism: Replaces internally occurring sink tokens with externally precomputed caches, avoiding impact on activation quantization range

Experimental Setup

Datasets

  • ImageNet-1k: For zero-shot image classification evaluation
  • MS-COCO: For image-text retrieval task evaluation
  • Other Classification Datasets: Stanford Cars, Flowers-102, Food-101, CIFAR-100 (for generalization validation)
  • Reference Data: 50,000 images from ImageNet-1k training set for register search

Evaluation Metrics

  • Zero-shot Classification Accuracy: Top-1 accuracy on ImageNet-1k
  • Retrieval Performance: Recall@1 and Recall@5 on MS-COCO
  • Outlier Analysis: Maximum token norm and average token norm

Baseline Methods

  • Baseline Quantization Algorithms:
    • PTQ4ViT: Dual uniform quantizer for ViT
    • RepQ-ViT: Scale reparameterization method
    • NoisyQuant: Noise-enhanced activation quantization
  • Precision Settings: W8A8 (8-bit weights, 8-bit activations) and W6A6 (6-bit weights, 6-bit activations)

Implementation Details

  • Uses 1,024 and 32 calibration samples (for NoisyQuant and RepQ-ViT respectively)
  • Register candidate count k=20, repetition count range τ∈{1,...,15}
  • Token deletion count k̃ tuned through reference task

Experimental Results

Main Results

Zero-shot Image Classification (ImageNet-1k)

ModelPrecisionBaseline BestRegCache BestImprovement
CLIP-B/16W8A867.69%67.78%+0.09%
CLIP-B/16W6A658.19%66.65%+13.40%
SigLIP2-B/16W8A876.92%77.26%+0.34%
SigLIP2-B/16W6A664.91%70.88%+5.97%

Image-Text Retrieval (MS-COCO)

  • CLIP-B/16: Average improvement of 3.76%-7.97% across all retrieval metrics
  • SigLIP-B/16: Recall@1 improvement of 0.20%, overall stable performance gains

Outlier Mitigation Effects

ModelMax Token Norm (Original)Max Token Norm (RegCache)Reduction Ratio
CLIP61.1715.30-75.0%
OpenCLIP122.9912.38-89.9%
SigLIP2244.7830.45-87.6%

Ablation Study

Ablation studies on SigLIP show:

  • Prefix Cache Only: Accuracy improves from 69.71% to 74.21%
  • Token Deletion Only: Accuracy drops to 38.51% (demonstrating need for prefix support)
  • Complete RegCache: Accuracy reaches 74.42%

Generalization Validation

Prefixes searched on ImageNet-1k remain effective on other datasets:

  • Stanford Cars: +1.78% to +47.47%
  • Food-101: +9.85% to +51.28%
  • CIFAR-100: +12.81% to +33.00%

Transformer Outlier Research

  • Systematic study of activation outliers in large-scale Transformers
  • Outlier behavior of specific tokens in LLMs (such as <BOS>, <SEP>)
  • Outliers in ViT typically correspond to uninformative background patches

Attention Sink Control

  • Attention sink: Tokens that attract excessive attention but contain little semantic information
  • Adding register tokens during training to absorb attention and mitigate attention sink
  • This work leverages sink tokens from PTQ perspective to improve quantization performance

ViT Post-training Quantization

  • Early methods: Allocate dynamic bit-widths for attention-sensitive layers
  • Existing methods: Isolate and minimize outlier impact through specialized quantization schemes
  • This work: Address outliers through token prefixing rather than quantizer granularity control

Conclusions and Discussion

Main Conclusions

  1. RegCache Effectiveness: Consistently improves performance across multiple vision encoders and quantization methods
  2. Outlier Mitigation Mechanism: Successfully transfers outliers from internal tokens to external precomputed caches
  3. Generalizability: Method applies to both text-supervised and self-supervised vision encoders

Limitations

  1. Hyperparameter Tuning: Requires evaluating multiple prefix candidates to determine optimal configuration
  2. Additional Hyperparameters: Introduces hyperparameters such as maximum token deletion count and prefix token count
  3. Computational Overhead: While FLOPs increase by no more than 0.2%, additional computational cost remains

Future Directions

  1. Multi-modal Difference Research: Deeper understanding of quantization behavior differences between text-supervised and self-supervised models
  2. Outlier Mechanism Understanding: Further investigation of fundamental causes for outlier behavior differences between ViT and LLM
  3. Automated Optimization: Develop methods to automatically determine optimal prefix configurations

In-depth Evaluation

Strengths

  1. Problem Importance: Addresses critical technical challenges in vision encoder quantization
  2. Method Innovation: First to introduce register concept to vision encoder quantization with novel technical approach
  3. Theoretical Insights: Provides deep analysis of essential differences in outlier behavior between vision encoders and LLMs
  4. Comprehensive Experiments: Covers 5 mainstream vision encoders and multiple quantization algorithms with convincing results
  5. Practical Value: Training-free and easily integrated into existing quantization pipelines

Weaknesses

  1. Limited Theoretical Analysis: Lacks deep theoretical explanation for why middle-layer prefixing is effective
  2. Hyperparameter Sensitivity: Method involves multiple hyperparameters that may affect practical deployment convenience
  3. Computational Overhead Analysis: While FLOPs increase is minimal, lacks detailed analysis of memory usage and latency
  4. Limited Scope: Primarily validates ViT architecture; applicability to other vision Transformer architectures insufficiently verified

Impact

  1. Academic Contribution: Provides new technical pathways and theoretical insights for vision encoder quantization
  2. Practical Value: Directly applicable to deployment optimization of existing vision encoders
  3. Reproducibility: Clear method description and detailed experimental setup ensure good reproducibility
  4. Inspirational Value: Provides important reference for cross-modal model optimization technique transfer

Applicable Scenarios

  1. Edge Deployment: Particularly suitable for deploying large-scale vision encoders on resource-constrained devices
  2. Real-time Applications: Applications requiring low-latency visual processing such as autonomous driving and robotic control
  3. Multimodal Systems: Quantization deployment of CLIP-like models in various downstream tasks
  4. Research Tool: Provides effective baseline method for vision Transformer quantization research

References

This paper cites important works across quantization, attention mechanisms, and vision Transformers, including:

  • Original papers on vision encoders such as CLIP and DINOv2
  • ViT quantization methods such as PTQ4ViT and RepQ-ViT
  • Research on attention sink and register tokens
  • Outlier handling methods in LLM quantization

Overall Assessment: This is a high-quality paper with significant contributions to the field of vision encoder quantization. The authors not only propose an effective technical solution but also provide valuable theoretical insights into the essential differences in outlier behavior between vision encoders and language models, offering both useful theoretical understanding and practical tools for the field's development.