2025-11-18T08:58:13.020607

Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Yang, Bajić
Mainstream image and video coding standards -- including state-of-the-art codecs like H.266/VVC, AVS3, and AV1 -- adopt a block-based hybrid coding framework. While this framework facilitates straightforward optimization for Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize perceptually-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. We introduce a lightweight model trained with perceptual losses to generate a quantization step map. This map implicitly captures block-level perceptual importance, enabling efficient derivation of a QP map for VVC. Experiments on Kodak and CLIC datasets demonstrate significant advantages, both in execution time and perceptual metric performance, with more than 11% BD-rate reduction in terms of MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs.
academic

Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Basic Information

  • Paper ID: 2510.10970
  • Title: Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
  • Authors: Runyu Yang, Ivan V. Bajić (Simon Fraser University)
  • Classification: eess.IV (Image and Video Processing)
  • Publication Time/Conference: Picture Coding Symposium 2025, Aachen, Germany
  • Paper Link: https://arxiv.org/abs/2510.10970

Abstract

Mainstream image and video coding standards, including the latest codecs such as H.266/VVC, AVS3, and AV1, adopt block-based hybrid coding frameworks. While this framework facilitates direct optimization for Peak Signal-to-Noise Ratio (PSNR), it encounters difficulties in optimizing perception-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method that enhances the perceptual quality of VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. The paper introduces a lightweight model trained with perceptual loss to generate quantization step-size maps that implicitly capture block-level perceptual importance, enabling effective derivation of VVC's QP maps. Experiments on the Kodak and CLIC datasets demonstrate significant advantages in both execution time and perceptual metric performance, with MS-SSIM BD-rate reduction exceeding 11%.

Research Background and Motivation

Core Problem

Traditional block-based video coding standards (such as VVC) primarily optimize for MSE/PSNR in rate-distortion optimization (RDO), but these metrics correlate poorly with human visual perception quality. Perception-aligned metrics (such as SSIM, MS-SSIM, LPIPS) are difficult to apply effectively in traditional block-level RDO frameworks due to their lack of additivity and block independence.

Problem Significance

  1. Discrepancy between perceptual quality and traditional metrics: MSE/PSNR exhibits significant gaps from human visual perception; optimizing these metrics does not guarantee good subjective quality
  2. Practical application requirements: Modern video applications increasingly demand higher perceptual quality, necessitating better perceptual optimization methods
  3. Computational complexity challenges: Direct optimization of complex perceptual metrics in traditional encoders incurs prohibitive computational costs

Limitations of Existing Methods

  1. End-to-end compression: While offering flexible optimization of perceptual metrics, it is incompatible with traditional standards
  2. Traditional perceptual optimization methods: Methods such as PerceptQPA demonstrate limited effectiveness
  3. Knowledge distillation methods: Methods such as Distillation require running the encoder network twice, resulting in excessive computational complexity

Core Contributions

  1. Proposes a low-complexity bit allocation transfer scheme: Transfers perceptual bit allocation knowledge from end-to-end image compression to the VVC encoder through a lightweight quantization step-size generation model
  2. Establishes linear relationship between quantization step-size and bit rate: Discovers that bit rate exhibits linear relationship with the reciprocal of quantization step-size, simplifying the QP map generation process
  3. Significantly reduces computational complexity: Compared to existing distillation methods, QP map generation time is reduced to one-tenth or less
  4. Achieves significant performance improvements on multiple datasets: MS-SSIM BD-rate reduction exceeds 11% while maintaining better PSNR performance

Method Details

Task Definition

Given an input image, generate a QP map applicable to the VVC encoder such that, under the same bit rate constraint, the encoding results achieve better performance on perceptual metrics (SSIM, MS-SSIM, etc.).

Model Architecture

Overall Framework

The method comprises two main stages:

  1. Training stage: Training the quantization step-size generation model with perceptual loss
  2. Inference stage: Generating quantization step-size maps and converting them to VVC's QP maps

Quantization Step-Size Generation Model

  • Architecture design: Employs stacked residual blocks and convolutional layers with stride 2
  • Output resolution: Same as latent features (original image downsampled 16×)
  • Activation function: Uses softplus to ensure positive outputs:
    softplus(x) = ln(1 + e^x)
    

End-to-End Image Compression Foundation

Based on mainstream hyperprior design, optimizes joint loss:

L = λD + R_main + R_hyper

where λ controls rate-distortion trade-off, D is distortion (MSE or perceptual metric), and R_main and R_hyper correspond to bit rates of quantized latent features and hyperprior, respectively.

Technical Innovations

1. Quantization Step-Size to Bit Rate Mapping

Experimental findings reveal linear relationship between bit rate and reciprocal of quantization step-size:

r_k ≈ 1/QS_k

where r_k is the bit rate of block k, and QS_k is the corresponding quantization step-size.

2. QP Adaptive Algorithm

Based on the R-λ model, block-level QP calculation formula:

QP_k = QP + 3log_2(r_k^β_k) ≈ QP - 3log_2(QS_k^β_k)

3. Perceptual Loss Optimization

Training three perceptual variants: 1-SSIM, 1-MS-SSIM, and LPIPS, with joint loss function:

L = λ(αD_perc) + R_main + R_hyper

Experimental Setup

Datasets

  1. Training data: LIU4K dataset containing 607,714 randomly cropped 256×256 patches from 1,600 original images and their 2×/4× bicubic downsampled versions
  2. Test data:
    • Kodak image set: 24 images, approximately 0.35MP
    • CLIC 2022 validation/test images: over 2MP

Evaluation Metrics

  • Traditional metrics: RGB PSNR
  • Perceptual metrics: SSIM, MS-SSIM, LPIPS
  • Comprehensive evaluation: BD-rate (Bjøntegaard Delta Rate)

Comparison Methods

  1. VTM-23.0: VVC reference software baseline
  2. PerceptQPA: High-pass filter-based QP adaptation method
  3. Distillation: Knowledge distillation method requiring encoder network execution twice

Implementation Details

  • QP settings: QP ∈ {37, 32, 27, 22} for rate alignment
  • Maximum QP offset: Limited to 4 to mitigate blocking artifacts
  • Training settings: Adam optimizer with initial learning rate 1e-4, trained for 5 epochs
  • Hyperparameters: α set to 0.02 (SSIM), 0.08 (MS-SSIM), 0.04 (LPIPS) respectively

Experimental Results

Main Results

Kodak Dataset Results

MethodPSNRSSIMMS-SSIMLPIPS
PerceptQPA2.85-4.26-11.86-11.96
Distillation (MS-SSIM)2.52-5.83-12.74-13.30
Proposed Method (MS-SSIM)0.98-6.19-11.88-10.96

CLIC Dataset Results

MethodPSNRSSIMMS-SSIMLPIPS
PerceptQPA3.20-2.42-9.91-11.51
Distillation (MS-SSIM)7.55-3.61-10.24-11.97
Proposed Method (MS-SSIM)2.46-5.91-11.26-10.88

Ablation Studies

Slope Parameter Impact

Adjusting slope from 1.0 to 1.2 enables more aggressive QP adaptation:

  • MS-SSIM optimization: BD-rate improves from -11.88% to -12.47%
  • PSNR performance slightly decreases: from 0.98% to 2.24%

Real Bit Rate vs. Approximation Method

Using real bit rate compared to reciprocal approximation method:

  • Slight decrease in perceptual metric performance
  • Better PSNR performance maintained

Computational Complexity Analysis

  • GPU environment: QP map generation requires only approximately 20ms (Kodak images)
  • CPU environment: Approximately 700ms
  • Compared to Distillation: Computational complexity reduced to one-tenth or less

Visual Quality Assessment

Visual evaluation at QP 37 demonstrates:

  • Structural regions: Noticeably improved perceptual quality
  • High-texture regions: Similar perceptual quality at lower bit rates
  • Overall performance comparable to PerceptQPA and Distillation

Traditional Perceptual Optimization Methods

  1. PerceptQPA: QP adaptation based on high-pass filtering, considering human visual system characteristics
  2. JND-based methods: Bit allocation utilizing Just Noticeable Difference

End-to-End Image Compression

  1. Hyperprior architecture: Variational image compression framework proposed by Ballé et al.
  2. Perceptual optimization: End-to-end models trained directly with perceptual loss
  3. Block-level structure: End-to-end models more aligned with traditional coding frameworks

Knowledge Transfer Methods

  1. Distillation methods: Extracting bit allocation knowledge from end-to-end models
  2. Feature transfer: Utilizing intermediate representations from deep learning models

Conclusions and Discussion

Main Conclusions

  1. Effectiveness: Successfully transfers perceptual bit allocation knowledge from end-to-end image compression to VVC encoder
  2. Efficiency: Significantly reduces computational complexity, making the method practical
  3. Generality: Method is effective for different perceptual metrics (SSIM, MS-SSIM)

Limitations

  1. Limited LPIPS optimization effectiveness: Optimization of deep perceptual metrics remains challenging
  2. Restricted to intra coding: Not yet extended to temporal optimization in video coding
  3. Architecture discrepancy: Architectural differences between end-to-end models and traditional encoders limit knowledge transfer effectiveness

Future Directions

  1. Video coding extension: Incorporating temporal information for perceptual optimization
  2. Machine vision tasks: Bit allocation targeting downstream tasks (e.g., object detection)
  3. Architecture alignment: Adopting end-to-end models more aligned with traditional coding frameworks

In-Depth Evaluation

Strengths

  1. Strong innovation: Proposes linear relationship between quantization step-size and bit rate, simplifying the transfer process
  2. High practical value: Substantially reduces computational complexity, providing practical application potential
  3. Comprehensive experiments: Thorough validation across multiple datasets and metrics
  4. Excellent performance: Significantly improves perceptual metrics while maintaining PSNR performance

Weaknesses

  1. Insufficient theoretical analysis: Lacks theoretical explanation for the linear relationship between quantization step-size and bit rate
  2. Limited applicability: Primarily applicable to SSIM and MS-SSIM, with limited effectiveness on LPIPS
  3. Parameter sensitivity: Hyperparameters such as slope require manual tuning
  4. Generalization capability: Generalization capability across different image types requires further verification

Impact

  1. Academic contribution: Provides new insights for perceptual optimization of traditional encoders
  2. Practical value: Low-complexity characteristics enable industrial application potential
  3. Reproducibility: Clear method description and detailed experimental setup

Applicable Scenarios

  1. Video streaming: Applications requiring perceptual quality enhancement under limited bandwidth
  2. Image compression: Image storage and transmission with high perceptual quality requirements
  3. Real-time applications: Scenarios with limited computational resources but requiring perceptual optimization

References

The paper cites 20 important references covering core works in video coding standards, perceptual quality assessment, end-to-end compression, and knowledge transfer, providing a solid theoretical foundation for the research.