Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
Yang, BajiÄ
Mainstream image and video coding standards -- including state-of-the-art codecs like H.266/VVC, AVS3, and AV1 -- adopt a block-based hybrid coding framework. While this framework facilitates straightforward optimization for Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize perceptually-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. We introduce a lightweight model trained with perceptual losses to generate a quantization step map. This map implicitly captures block-level perceptual importance, enabling efficient derivation of a QP map for VVC. Experiments on Kodak and CLIC datasets demonstrate significant advantages, both in execution time and perceptual metric performance, with more than 11% BD-rate reduction in terms of MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs.
academic
Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
Mainstream image and video coding standards, including the latest codecs such as H.266/VVC, AVS3, and AV1, adopt block-based hybrid coding frameworks. While this framework facilitates direct optimization for Peak Signal-to-Noise Ratio (PSNR), it encounters difficulties in optimizing perception-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method that enhances the perceptual quality of VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. The paper introduces a lightweight model trained with perceptual loss to generate quantization step-size maps that implicitly capture block-level perceptual importance, enabling effective derivation of VVC's QP maps. Experiments on the Kodak and CLIC datasets demonstrate significant advantages in both execution time and perceptual metric performance, with MS-SSIM BD-rate reduction exceeding 11%.
Traditional block-based video coding standards (such as VVC) primarily optimize for MSE/PSNR in rate-distortion optimization (RDO), but these metrics correlate poorly with human visual perception quality. Perception-aligned metrics (such as SSIM, MS-SSIM, LPIPS) are difficult to apply effectively in traditional block-level RDO frameworks due to their lack of additivity and block independence.
Discrepancy between perceptual quality and traditional metrics: MSE/PSNR exhibits significant gaps from human visual perception; optimizing these metrics does not guarantee good subjective quality
Practical application requirements: Modern video applications increasingly demand higher perceptual quality, necessitating better perceptual optimization methods
Computational complexity challenges: Direct optimization of complex perceptual metrics in traditional encoders incurs prohibitive computational costs
End-to-end compression: While offering flexible optimization of perceptual metrics, it is incompatible with traditional standards
Traditional perceptual optimization methods: Methods such as PerceptQPA demonstrate limited effectiveness
Knowledge distillation methods: Methods such as Distillation require running the encoder network twice, resulting in excessive computational complexity
Proposes a low-complexity bit allocation transfer scheme: Transfers perceptual bit allocation knowledge from end-to-end image compression to the VVC encoder through a lightweight quantization step-size generation model
Establishes linear relationship between quantization step-size and bit rate: Discovers that bit rate exhibits linear relationship with the reciprocal of quantization step-size, simplifying the QP map generation process
Significantly reduces computational complexity: Compared to existing distillation methods, QP map generation time is reduced to one-tenth or less
Achieves significant performance improvements on multiple datasets: MS-SSIM BD-rate reduction exceeds 11% while maintaining better PSNR performance
Given an input image, generate a QP map applicable to the VVC encoder such that, under the same bit rate constraint, the encoding results achieve better performance on perceptual metrics (SSIM, MS-SSIM, etc.).
Based on mainstream hyperprior design, optimizes joint loss:
L = λD + R_main + R_hyper
where λ controls rate-distortion trade-off, D is distortion (MSE or perceptual metric), and R_main and R_hyper correspond to bit rates of quantized latent features and hyperprior, respectively.
Training data: LIU4K dataset containing 607,714 randomly cropped 256×256 patches from 1,600 original images and their 2×/4× bicubic downsampled versions
The paper cites 20 important references covering core works in video coding standards, perceptual quality assessment, end-to-end compression, and knowledge transfer, providing a solid theoretical foundation for the research.