2025-11-10T02:44:53.419690

Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Zheng, Li
Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.
academic

Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Basic Information

  • Paper ID: 2510.13331
  • Title: Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models
  • Authors: Hong-Kai Zheng, Piji Li (Nanjing University of Aeronautics and Astronautics)
  • Classification: cs.CV
  • Publication Time/Conference: ICLR 2026
  • Paper Link: https://arxiv.org/abs/2510.13331

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) perform self-supervised learning through reconstruction tasks, representing continuous vectors using the nearest vectors from a codebook. However, VQ models still suffer from issues such as codebook collapse. To address these problems, existing methods employ either implicit static codebooks or joint optimization of the entire codebook, but these approaches limit the codebook's learning capacity, resulting in degraded reconstruction quality. This paper proposes Group-VQ, which performs group-wise optimization of the codebook. Each group is optimized independently, while joint optimization occurs within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Furthermore, we introduce a training-free codebook resampling method that allows codebook size adjustment after training. Experiments on image reconstruction across various settings demonstrate that Group-VQ achieves improved performance on reconstruction metrics.

Research Background and Motivation

Problem Description

Vector Quantization (VQ) is a technique that maps continuous features to discrete tokens, widely applied in VQ-VAE. However, traditional VQ training faces the problem of low codebook utilization, where only a fraction of code vectors are used and updated, leading to "codebook collapse," which limits the model's encoding capacity.

Limitations of Existing Methods

  1. Vanilla VQ: Each code vector is updated independently, easily leading to codebook collapse
  2. Joint VQ methods (e.g., SimVQ, VQGAN-LC): Achieve 100% utilization through shared parameter joint optimization of the entire codebook, but limit the codebook's learning capacity

Research Motivation

The authors experimentally found that while Joint VQ rapidly achieves 100% codebook utilization, its reconstruction quality is actually inferior to Vanilla VQ at the same utilization rate. This indicates that there exists a trade-off between codebook utilization and reconstruction performance, requiring a better balancing strategy.

Core Contributions

  1. Proposes Group-VQ method: A group-based codebook optimization method that balances utilization and reconstruction performance in VQ models
  2. Generalizes Joint VQ method: Reinterprets Joint VQ from the perspective of shared parameters and introduces post-training codebook sampling
  3. Training-free codebook adjustment: Enables flexible codebook size adjustment after training without model retraining
  4. Comprehensive experimental validation: Verifies the effectiveness of Group-VQ and codebook resampling on image reconstruction tasks

Methodology Details

Task Definition

Given an image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, VQ-VAE first uses an encoder to obtain feature maps ZRh×w×dZ \in \mathbb{R}^{h \times w \times d}, then replaces each feature vector zRdz \in \mathbb{R}^d through a quantizer with the nearest code vector from codebook C={qiqiRd,i=0,1,...,n1}C = \{q_i | q_i \in \mathbb{R}^d, i = 0,1,...,n-1\}:

q=argminqiCzqi,i=0,1,...,n1q = \arg\min_{q_i \in C} \|z - q_i\|, i = 0,1,...,n-1

Model Architecture

Group-VQ Design

Group-VQ partitions the codebook CC into kk disjoint groups (sub-codebooks):

C=j=0k1Gj,GjGj= if jjC = \bigcup_{j=0}^{k-1} G_j, \quad G_j \cap G_{j'} = \emptyset \text{ if } j \neq j'

Each group GjG_j is updated independently, with joint optimization within groups. For code vector qjtGjq_{jt} \in G_j, its gradient update is:

qjtLcmt=qjtLj\nabla_{q_{jt}} L_{cmt} = \nabla_{q_{jt}} L_j

This ensures that each group is only affected by gradients generated from its internal code vectors.

Codebook Parameterization

Each group GjG_j is parameterized through shared parameters:

Gj=G^jWj+bjG_j = \hat{G}_j W_j + b_j

where:

  • G^jRnj×rj\hat{G}_j \in \mathbb{R}^{n_j \times r_j}: Codebook core (fixed distribution sampling)
  • WjRrj×dW_j \in \mathbb{R}^{r_j \times d}: Projector (learnable)
  • bjRdb_j \in \mathbb{R}^d: Bias vector

Technical Innovations

1. Unified Analytical Perspective

  • Vanilla VQ: k=nk = n, each code vector as one group
  • Joint VQ: k=1k = 1, entire codebook as one group
  • Group-VQ: 1kn1 \leq k \leq n, balancing both extremes

2. Codebook Resampling Mechanism

Leveraging the generative nature of the codebook, post-training resampling is possible:

q~=v^Wj,v^N(0,I)\tilde{q} = \hat{v} W_j, \quad \hat{v} \sim \mathcal{N}(0, I)

Supporting two modes:

  • Resampling: Complete codebook replacement
  • Self-extension: Adding new code vectors to the original codebook

Experimental Setup

Datasets

  • ImageNet-1k: Primary dataset
  • MS-COCO: Supplementary validation
  • Input Resolution: 128×128, downsampling factor f=8

Evaluation Metrics

  • rFID (reconstruction FID): Distribution distance between reconstructed and original images
  • LPIPS(VGG16): Perceptual similarity
  • PSNR: Peak Signal-to-Noise Ratio
  • SSIM: Structural Similarity Index

Comparison Methods

  • VQGAN, ViT-VQGAN, VQGAN-FC
  • FSQ, LFQ (fixed codebook methods)
  • VQGAN-LC, SimVQ (Joint VQ methods)

Implementation Details

  • Learning rate: 1×10⁻⁴
  • Optimizer: Adam (β₁=0.5, β₂=0.9)
  • Batch size: 32/GPU
  • Hardware: NVIDIA A5000 GPU

Experimental Results

Main Results

Performance Comparison on ImageNet-1k (codebook size 65,536):

MethodGroupsUtilizationrFID↓LPIPS↓PSNR↑SSIM↑
VQGAN65,5361.4%3.740.1722.200.706
SimVQ1100.0%1.990.1224.340.788
Group-VQ6499.9%1.860.1124.370.787

Group-VQ achieves the best performance on all metrics, significantly outperforming baseline methods.

Ablation Studies

Impact of Different Group Numbers:

Groups13264128512
Utilization100%100%100%95.6%78.8%
rFID↓6.456.056.096.116.28

Experiments show that 32-64 groups represent the optimal choice, balancing codebook utilization and reconstruction performance.

Codebook Resampling Experiments

Codebook Size Adjustment Results:

MethodCodebook SizerFID↓PSNR↑
Group-VQ65,5361.8724.32
+ Downsampling32,7682.1624.02
+ Upsampling131,0721.7924.49
+ Self-extension131,0721.7624.51

Results validate the effectiveness of the codebook resampling method, enabling flexible codebook size adjustment with expected performance changes.

Visualization Analysis

Through random projection of code vectors to 2D space, the authors found:

  1. Different groups learn different feature distributions
  2. Code vectors within groups are relatively similar, with significant differences between groups
  3. Statistical properties (mean, variance, usage frequency) of each group differ notably

Classification of VQ Improvement Methods

  1. Straight-Through Estimator improvements: Optimizing gradient propagation
  2. Multi-index quantization: RQ-VAE, Product Quantization, etc.
  3. Codebook improvements: The focus of this paper

Joint VQ Methods

  • VQGAN-LC: Uses pre-trained feature initialization + projection layer
  • SimVQ: Random initialization + matrix reparameterization
  • LFQ/FSQ: Fixed codebook to avoid collapse

This paper unifies these methods as "Joint VQ implemented through shared parameters" and proposes group-wise optimization strategy on this basis.

Conclusions and Discussion

Main Conclusions

  1. Trade-off between codebook utilization and reconstruction quality: 100% utilization does not necessarily lead to optimal reconstruction
  2. Group-wise optimization is an effective balancing strategy: Group-VQ achieves flexible control through group number adjustment
  3. Codebook resampling provides practical value: Post-training codebook size adjustment is feasible

Limitations

  1. Lack of validation on generative tasks: Only tested on reconstruction tasks, missing validation on generative models
  2. Group number selection requires tuning: Optimal group numbers depend on specific tasks and datasets
  3. Computational complexity: Multi-group optimization may increase training time

Future Directions

  1. Validate Group-VQ effectiveness on generative models (e.g., autoregressive models)
  2. Explore adaptive group number selection strategies
  3. Investigate combinations of Group-VQ with other VQ improvement methods

In-Depth Evaluation

Strengths

  1. Clear theoretical contribution: Unifies understanding of existing VQ methods from a group optimization perspective, providing new analytical insights
  2. Simple and effective method: Group-VQ design is intuitive, easy to implement and understand
  3. Comprehensive experiments: Full validation across multiple datasets and architectures with detailed ablation studies
  4. High practical value: Codebook resampling method addresses flexibility requirements in real applications

Weaknesses

  1. Insufficient theoretical analysis: Lacks theoretical explanation for why group-wise optimization is more effective
  2. Limited applicability scope: Primarily focuses on image reconstruction; effectiveness on other modalities and tasks remains unknown
  3. Missing computational overhead analysis: Lacks detailed analysis of computational costs for multi-group optimization

Impact

  1. Academic value: Provides new optimization insights for VQ research, potentially inspiring subsequent work
  2. Practical value: Codebook resampling method is valuable in actual deployment
  3. Reproducibility: Authors promise to release code, facilitating method adoption

Applicable Scenarios

  1. Image/video encoding: Compression tasks requiring high-quality reconstruction
  2. Multimodal learning: As a universal vector quantization component
  3. Generative models: Serving as tokenizer providing discrete representations for generative models

References

This paper primarily builds upon the following important works:

  1. Van Den Oord et al. (2017) - Original VQ-VAE paper
  2. Zhu et al. (2024b) - SimVQ method
  3. Yu et al. (2023) - LFQ method
  4. Mentzer et al. (2023) - FSQ method

Summary: This is a paper with significant contributions to the VQ field. The Group-VQ method is simple yet effective, providing new insights for codebook optimization. The codebook resampling method has strong practical value. While there is room for improvement in theoretical analysis and applicability scope, overall this is a high-quality research work.