2025-11-10T02:44:53.419690

Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Zheng, Li

Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.

academic

Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Basic Information

Paper ID: 2510.13331
Title: Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models
Authors: Hong-Kai Zheng, Piji Li (Nanjing University of Aeronautics and Astronautics)
Classification: cs.CV
Publication Time/Conference: ICLR 2026
Paper Link: https://arxiv.org/abs/2510.13331

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) perform self-supervised learning through reconstruction tasks, representing continuous vectors using the nearest vectors from a codebook. However, VQ models still suffer from issues such as codebook collapse. To address these problems, existing methods employ either implicit static codebooks or joint optimization of the entire codebook, but these approaches limit the codebook's learning capacity, resulting in degraded reconstruction quality. This paper proposes Group-VQ, which performs group-wise optimization of the codebook. Each group is optimized independently, while joint optimization occurs within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Furthermore, we introduce a training-free codebook resampling method that allows codebook size adjustment after training. Experiments on image reconstruction across various settings demonstrate that Group-VQ achieves improved performance on reconstruction metrics.

Research Background and Motivation

Problem Description

Vector Quantization (VQ) is a technique that maps continuous features to discrete tokens, widely applied in VQ-VAE. However, traditional VQ training faces the problem of low codebook utilization, where only a fraction of code vectors are used and updated, leading to "codebook collapse," which limits the model's encoding capacity.

Limitations of Existing Methods

Vanilla VQ: Each code vector is updated independently, easily leading to codebook collapse
Joint VQ methods (e.g., SimVQ, VQGAN-LC): Achieve 100% utilization through shared parameter joint optimization of the entire codebook, but limit the codebook's learning capacity

Research Motivation

The authors experimentally found that while Joint VQ rapidly achieves 100% codebook utilization, its reconstruction quality is actually inferior to Vanilla VQ at the same utilization rate. This indicates that there exists a trade-off between codebook utilization and reconstruction performance, requiring a better balancing strategy.

Core Contributions

Proposes Group-VQ method: A group-based codebook optimization method that balances utilization and reconstruction performance in VQ models
Generalizes Joint VQ method: Reinterprets Joint VQ from the perspective of shared parameters and introduces post-training codebook sampling
Training-free codebook adjustment: Enables flexible codebook size adjustment after training without model retraining
Comprehensive experimental validation: Verifies the effectiveness of Group-VQ and codebook resampling on image reconstruction tasks

Methodology Details

Task Definition

Given an image $I \in \mathbb{R}^{H \times W \times 3}$ , VQ-VAE first uses an encoder to obtain feature maps $Z \in \mathbb{R}^{h \times w \times d}$ , then replaces each feature vector $z \in \mathbb{R}^d$ through a quantizer with the nearest code vector from codebook $C = \{q_i | q_i \in \mathbb{R}^d, i = 0,1,...,n-1\}$ :

$q = \arg\min_{q_i \in C} \|z - q_i\|, i = 0,1,...,n-1$

Model Architecture

Group-VQ Design

Group-VQ partitions the codebook $C$ into $k$ disjoint groups (sub-codebooks):

$C = \bigcup_{j=0}^{k-1} G_j, \quad G_j \cap G_{j'} = \emptyset \text{ if } j \neq j'$

Each group $G_j$ is updated independently, with joint optimization within groups. For code vector $q_{jt} \in G_j$ , its gradient update is:

$\nabla_{q_{jt}} L_{cmt} = \nabla_{q_{jt}} L_j$

This ensures that each group is only affected by gradients generated from its internal code vectors.

Codebook Parameterization

Each group $G_j$ is parameterized through shared parameters:

$G_j = \hat{G}_j W_j + b_j$

where:

$\hat{G}_j \in \mathbb{R}^{n_j \times r_j}$ : Codebook core (fixed distribution sampling)
$W_j \in \mathbb{R}^{r_j \times d}$ : Projector (learnable)
$b_j \in \mathbb{R}^d$ : Bias vector

Technical Innovations

1. Unified Analytical Perspective

Vanilla VQ: $k = n$ , each code vector as one group
Joint VQ: $k = 1$ , entire codebook as one group
Group-VQ: $1 \leq k \leq n$ , balancing both extremes

2. Codebook Resampling Mechanism

Leveraging the generative nature of the codebook, post-training resampling is possible:

$\tilde{q} = \hat{v} W_j, \quad \hat{v} \sim \mathcal{N}(0, I)$

Supporting two modes:

Resampling: Complete codebook replacement
Self-extension: Adding new code vectors to the original codebook

Experimental Setup

Datasets

ImageNet-1k: Primary dataset
MS-COCO: Supplementary validation
Input Resolution: 128×128, downsampling factor f=8

Evaluation Metrics

rFID (reconstruction FID): Distribution distance between reconstructed and original images
LPIPS(VGG16): Perceptual similarity
PSNR: Peak Signal-to-Noise Ratio
SSIM: Structural Similarity Index

Comparison Methods

VQGAN, ViT-VQGAN, VQGAN-FC
FSQ, LFQ (fixed codebook methods)
VQGAN-LC, SimVQ (Joint VQ methods)

Implementation Details

Learning rate: 1×10⁻⁴
Optimizer: Adam (β₁=0.5, β₂=0.9)
Batch size: 32/GPU
Hardware: NVIDIA A5000 GPU

Experimental Results

Main Results

Performance Comparison on ImageNet-1k (codebook size 65,536):

Method	Groups	Utilization	rFID↓	LPIPS↓	PSNR↑	SSIM↑
VQGAN	65,536	1.4%	3.74	0.17	22.20	0.706
SimVQ	1	100.0%	1.99	0.12	24.34	0.788
Group-VQ	64	99.9%	1.86	0.11	24.37	0.787

Group-VQ achieves the best performance on all metrics, significantly outperforming baseline methods.

Ablation Studies

Impact of Different Group Numbers:

Groups	1	32	64	128	512
Utilization	100%	100%	100%	95.6%	78.8%
rFID↓	6.45	6.05	6.09	6.11	6.28

Experiments show that 32-64 groups represent the optimal choice, balancing codebook utilization and reconstruction performance.

Codebook Resampling Experiments

Codebook Size Adjustment Results:

Method	Codebook Size	rFID↓	PSNR↑
Group-VQ	65,536	1.87	24.32
+ Downsampling	32,768	2.16	24.02
+ Upsampling	131,072	1.79	24.49
+ Self-extension	131,072	1.76	24.51

Results validate the effectiveness of the codebook resampling method, enabling flexible codebook size adjustment with expected performance changes.

Visualization Analysis

Through random projection of code vectors to 2D space, the authors found:

Different groups learn different feature distributions
Code vectors within groups are relatively similar, with significant differences between groups
Statistical properties (mean, variance, usage frequency) of each group differ notably

Classification of VQ Improvement Methods

Straight-Through Estimator improvements: Optimizing gradient propagation
Multi-index quantization: RQ-VAE, Product Quantization, etc.
Codebook improvements: The focus of this paper

Joint VQ Methods

VQGAN-LC: Uses pre-trained feature initialization + projection layer
SimVQ: Random initialization + matrix reparameterization
LFQ/FSQ: Fixed codebook to avoid collapse

This paper unifies these methods as "Joint VQ implemented through shared parameters" and proposes group-wise optimization strategy on this basis.

Conclusions and Discussion

Main Conclusions

Trade-off between codebook utilization and reconstruction quality: 100% utilization does not necessarily lead to optimal reconstruction
Group-wise optimization is an effective balancing strategy: Group-VQ achieves flexible control through group number adjustment
Codebook resampling provides practical value: Post-training codebook size adjustment is feasible

Limitations

Lack of validation on generative tasks: Only tested on reconstruction tasks, missing validation on generative models
Group number selection requires tuning: Optimal group numbers depend on specific tasks and datasets
Computational complexity: Multi-group optimization may increase training time

Future Directions

Validate Group-VQ effectiveness on generative models (e.g., autoregressive models)
Explore adaptive group number selection strategies
Investigate combinations of Group-VQ with other VQ improvement methods

In-Depth Evaluation

Strengths

Clear theoretical contribution: Unifies understanding of existing VQ methods from a group optimization perspective, providing new analytical insights
Simple and effective method: Group-VQ design is intuitive, easy to implement and understand
Comprehensive experiments: Full validation across multiple datasets and architectures with detailed ablation studies
High practical value: Codebook resampling method addresses flexibility requirements in real applications

Weaknesses

Insufficient theoretical analysis: Lacks theoretical explanation for why group-wise optimization is more effective
Limited applicability scope: Primarily focuses on image reconstruction; effectiveness on other modalities and tasks remains unknown
Missing computational overhead analysis: Lacks detailed analysis of computational costs for multi-group optimization

Impact

Academic value: Provides new optimization insights for VQ research, potentially inspiring subsequent work
Practical value: Codebook resampling method is valuable in actual deployment
Reproducibility: Authors promise to release code, facilitating method adoption

Applicable Scenarios

Image/video encoding: Compression tasks requiring high-quality reconstruction
Multimodal learning: As a universal vector quantization component
Generative models: Serving as tokenizer providing discrete representations for generative models

References

This paper primarily builds upon the following important works:

Van Den Oord et al. (2017) - Original VQ-VAE paper
Zhu et al. (2024b) - SimVQ method
Yu et al. (2023) - LFQ method
Mentzer et al. (2023) - FSQ method

Summary: This is a paper with significant contributions to the VQ field. The Group-VQ method is simple yet effective, providing new insights for codebook optimization. The codebook resampling method has strong practical value. While there is room for improvement in theoretical analysis and applicability scope, overall this is a high-quality research work.