2025-11-18T08:58:13.020607

Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Yang, BajiÄ

Mainstream image and video coding standards -- including state-of-the-art codecs like H.266/VVC, AVS3, and AV1 -- adopt a block-based hybrid coding framework. While this framework facilitates straightforward optimization for Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize perceptually-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. We introduce a lightweight model trained with perceptual losses to generate a quantization step map. This map implicitly captures block-level perceptual importance, enabling efficient derivation of a QP map for VVC. Experiments on Kodak and CLIC datasets demonstrate significant advantages, both in execution time and perceptual metric performance, with more than 11% BD-rate reduction in terms of MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs.

academic

Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Basic Information

Paper ID: 2510.10970
Title: Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
Authors: Runyu Yang, Ivan V. Bajić (Simon Fraser University)
Classification: eess.IV (Image and Video Processing)
Publication Time/Conference: Picture Coding Symposium 2025, Aachen, Germany
Paper Link: https://arxiv.org/abs/2510.10970

Abstract

Mainstream image and video coding standards, including the latest codecs such as H.266/VVC, AVS3, and AV1, adopt block-based hybrid coding frameworks. While this framework facilitates direct optimization for Peak Signal-to-Noise Ratio (PSNR), it encounters difficulties in optimizing perception-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method that enhances the perceptual quality of VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. The paper introduces a lightweight model trained with perceptual loss to generate quantization step-size maps that implicitly capture block-level perceptual importance, enabling effective derivation of VVC's QP maps. Experiments on the Kodak and CLIC datasets demonstrate significant advantages in both execution time and perceptual metric performance, with MS-SSIM BD-rate reduction exceeding 11%.

Research Background and Motivation

Core Problem

Traditional block-based video coding standards (such as VVC) primarily optimize for MSE/PSNR in rate-distortion optimization (RDO), but these metrics correlate poorly with human visual perception quality. Perception-aligned metrics (such as SSIM, MS-SSIM, LPIPS) are difficult to apply effectively in traditional block-level RDO frameworks due to their lack of additivity and block independence.

Problem Significance

Discrepancy between perceptual quality and traditional metrics: MSE/PSNR exhibits significant gaps from human visual perception; optimizing these metrics does not guarantee good subjective quality
Practical application requirements: Modern video applications increasingly demand higher perceptual quality, necessitating better perceptual optimization methods
Computational complexity challenges: Direct optimization of complex perceptual metrics in traditional encoders incurs prohibitive computational costs

Limitations of Existing Methods

End-to-end compression: While offering flexible optimization of perceptual metrics, it is incompatible with traditional standards
Traditional perceptual optimization methods: Methods such as PerceptQPA demonstrate limited effectiveness
Knowledge distillation methods: Methods such as Distillation require running the encoder network twice, resulting in excessive computational complexity

Core Contributions

Proposes a low-complexity bit allocation transfer scheme: Transfers perceptual bit allocation knowledge from end-to-end image compression to the VVC encoder through a lightweight quantization step-size generation model
Establishes linear relationship between quantization step-size and bit rate: Discovers that bit rate exhibits linear relationship with the reciprocal of quantization step-size, simplifying the QP map generation process
Significantly reduces computational complexity: Compared to existing distillation methods, QP map generation time is reduced to one-tenth or less
Achieves significant performance improvements on multiple datasets: MS-SSIM BD-rate reduction exceeds 11% while maintaining better PSNR performance

Method Details

Task Definition

Given an input image, generate a QP map applicable to the VVC encoder such that, under the same bit rate constraint, the encoding results achieve better performance on perceptual metrics (SSIM, MS-SSIM, etc.).

Model Architecture

Overall Framework

The method comprises two main stages:

Training stage: Training the quantization step-size generation model with perceptual loss
Inference stage: Generating quantization step-size maps and converting them to VVC's QP maps

Quantization Step-Size Generation Model

Architecture design: Employs stacked residual blocks and convolutional layers with stride 2
Output resolution: Same as latent features (original image downsampled 16×)
Activation function: Uses softplus to ensure positive outputs:
```
softplus(x) = ln(1 + e^x)
```

End-to-End Image Compression Foundation

Based on mainstream hyperprior design, optimizes joint loss:

L = λD + R_main + R_hyper

where λ controls rate-distortion trade-off, D is distortion (MSE or perceptual metric), and R_main and R_hyper correspond to bit rates of quantized latent features and hyperprior, respectively.

Technical Innovations

1. Quantization Step-Size to Bit Rate Mapping

Experimental findings reveal linear relationship between bit rate and reciprocal of quantization step-size:

r_k ≈ 1/QS_k

where r_k is the bit rate of block k, and QS_k is the corresponding quantization step-size.

2. QP Adaptive Algorithm

Based on the R-λ model, block-level QP calculation formula:

QP_k = QP + 3log_2(r_k^β_k) ≈ QP - 3log_2(QS_k^β_k)

3. Perceptual Loss Optimization

Training three perceptual variants: 1-SSIM, 1-MS-SSIM, and LPIPS, with joint loss function:

L = λ(αD_perc) + R_main + R_hyper

Experimental Setup

Datasets

Training data: LIU4K dataset containing 607,714 randomly cropped 256×256 patches from 1,600 original images and their 2×/4× bicubic downsampled versions
Test data:
- Kodak image set: 24 images, approximately 0.35MP
- CLIC 2022 validation/test images: over 2MP

Evaluation Metrics

Traditional metrics: RGB PSNR
Perceptual metrics: SSIM, MS-SSIM, LPIPS
Comprehensive evaluation: BD-rate (Bjøntegaard Delta Rate)

Comparison Methods

VTM-23.0: VVC reference software baseline
PerceptQPA: High-pass filter-based QP adaptation method
Distillation: Knowledge distillation method requiring encoder network execution twice

Implementation Details

QP settings: QP ∈ {37, 32, 27, 22} for rate alignment
Maximum QP offset: Limited to 4 to mitigate blocking artifacts
Training settings: Adam optimizer with initial learning rate 1e-4, trained for 5 epochs
Hyperparameters: α set to 0.02 (SSIM), 0.08 (MS-SSIM), 0.04 (LPIPS) respectively

Experimental Results

Main Results

Kodak Dataset Results

Method	PSNR	SSIM	MS-SSIM	LPIPS
PerceptQPA	2.85	-4.26	-11.86	-11.96
Distillation (MS-SSIM)	2.52	-5.83	-12.74	-13.30
Proposed Method (MS-SSIM)	0.98	-6.19	-11.88	-10.96

CLIC Dataset Results

Method	PSNR	SSIM	MS-SSIM	LPIPS
PerceptQPA	3.20	-2.42	-9.91	-11.51
Distillation (MS-SSIM)	7.55	-3.61	-10.24	-11.97
Proposed Method (MS-SSIM)	2.46	-5.91	-11.26	-10.88

Ablation Studies

Slope Parameter Impact

Adjusting slope from 1.0 to 1.2 enables more aggressive QP adaptation:

MS-SSIM optimization: BD-rate improves from -11.88% to -12.47%
PSNR performance slightly decreases: from 0.98% to 2.24%

Real Bit Rate vs. Approximation Method

Using real bit rate compared to reciprocal approximation method:

Slight decrease in perceptual metric performance
Better PSNR performance maintained

Computational Complexity Analysis

GPU environment: QP map generation requires only approximately 20ms (Kodak images)
CPU environment: Approximately 700ms
Compared to Distillation: Computational complexity reduced to one-tenth or less

Visual Quality Assessment

Visual evaluation at QP 37 demonstrates:

Structural regions: Noticeably improved perceptual quality
High-texture regions: Similar perceptual quality at lower bit rates
Overall performance comparable to PerceptQPA and Distillation

Traditional Perceptual Optimization Methods

PerceptQPA: QP adaptation based on high-pass filtering, considering human visual system characteristics
JND-based methods: Bit allocation utilizing Just Noticeable Difference

End-to-End Image Compression

Hyperprior architecture: Variational image compression framework proposed by Ballé et al.
Perceptual optimization: End-to-end models trained directly with perceptual loss
Block-level structure: End-to-end models more aligned with traditional coding frameworks

Knowledge Transfer Methods

Distillation methods: Extracting bit allocation knowledge from end-to-end models
Feature transfer: Utilizing intermediate representations from deep learning models

Conclusions and Discussion

Main Conclusions

Effectiveness: Successfully transfers perceptual bit allocation knowledge from end-to-end image compression to VVC encoder
Efficiency: Significantly reduces computational complexity, making the method practical
Generality: Method is effective for different perceptual metrics (SSIM, MS-SSIM)

Limitations

Limited LPIPS optimization effectiveness: Optimization of deep perceptual metrics remains challenging
Restricted to intra coding: Not yet extended to temporal optimization in video coding
Architecture discrepancy: Architectural differences between end-to-end models and traditional encoders limit knowledge transfer effectiveness

Future Directions

Video coding extension: Incorporating temporal information for perceptual optimization
Machine vision tasks: Bit allocation targeting downstream tasks (e.g., object detection)
Architecture alignment: Adopting end-to-end models more aligned with traditional coding frameworks

In-Depth Evaluation

Strengths

Strong innovation: Proposes linear relationship between quantization step-size and bit rate, simplifying the transfer process
High practical value: Substantially reduces computational complexity, providing practical application potential
Comprehensive experiments: Thorough validation across multiple datasets and metrics
Excellent performance: Significantly improves perceptual metrics while maintaining PSNR performance

Weaknesses

Insufficient theoretical analysis: Lacks theoretical explanation for the linear relationship between quantization step-size and bit rate
Limited applicability: Primarily applicable to SSIM and MS-SSIM, with limited effectiveness on LPIPS
Parameter sensitivity: Hyperparameters such as slope require manual tuning
Generalization capability: Generalization capability across different image types requires further verification

Impact

Academic contribution: Provides new insights for perceptual optimization of traditional encoders
Practical value: Low-complexity characteristics enable industrial application potential
Reproducibility: Clear method description and detailed experimental setup

Applicable Scenarios

Video streaming: Applications requiring perceptual quality enhancement under limited bandwidth
Image compression: Image storage and transmission with high perceptual quality requirements
Real-time applications: Scenarios with limited computational resources but requiring perceptual optimization

References

The paper cites 20 important references covering core works in video coding standards, perceptual quality assessment, end-to-end compression, and knowledge transfer, providing a solid theoretical foundation for the research.