2025-11-12T21:49:14.281280

DeePAQ: A Perceptual Audio Quality Metric Based On Foundational Models and Weakly Supervised Learning

Jiang, Brendel, Delgado et al.
This paper presents the Deep learning-based Perceptual Audio Quality metric (DeePAQ) for evaluating general audio quality. Our approach leverages metric learning together with the music foundation model MERT, guided by surrogate labels, to construct an embedding space that captures distortion intensity in general audio. To the best of our knowledge, DeePAQ is the first in the general audio quality domain to leverage weakly supervised labels and metric learning for fine-tuning a music foundation model with Low-Rank Adaptation (LoRA), a direction not yet explored by other state-of-the-art methods. We benchmark the proposed model against state-of-the-art objective audio quality metrics across listening tests spanning audio coding and source separation. Results show that our method surpasses existing metrics in detecting coding artifacts and generalizes well to unseen distortions such as source separation, highlighting its robustness and versatility.
academic

DeePAQ: A Perceptual Audio Quality Metric Based On Foundational Models and Weakly Supervised Learning

Basic Information

  • Paper ID: 2510.12326
  • Title: DeePAQ: A Perceptual Audio Quality Metric Based On Foundational Models and Weakly Supervised Learning
  • Authors: Guanxin Jiang, Andreas Brendel, Pablo M. Delgado, Jürgen Herre
  • Institution: International Audio Laboratories Erlangen, Fraunhofer Institute for Integrated Circuits IIS
  • Category: eess.AS (Audio and Speech Processing)
  • Publication Date: October 14, 2025
  • Paper Link: https://arxiv.org/abs/2510.12326

Abstract

This paper proposes DeePAQ, a deep learning-based perceptual audio quality metric for assessing general-purpose audio quality. The method combines metric learning with the MERT music foundation model, constructing an embedding space guided by proxy labels that captures the intensity of generic audio distortions. To the authors' knowledge, DeePAQ is the first method in the general audio quality domain to leverage weakly supervised labels and metric learning to fine-tune a music foundation model via Low-Rank Adaptation (LoRA). In listening tests covering audio coding and source separation, the method surpasses existing objective audio quality metrics, demonstrates excellent performance in detecting coding artifacts, and exhibits strong generalization to unseen distortions such as source separation.

Research Background and Motivation

Problem Definition

Audio quality assessment is a core problem in audio processing. While subjective listening tests are accurate, they are time-consuming, expensive, and impractical. Therefore, objective computational methods are needed to estimate perceptual audio quality.

Research Challenges

  1. Data Scarcity: Compared to speech quality assessment, subjective ratings for music content under different distortion types are scarcer and rarely publicly available
  2. Signal Complexity: Music signals exhibit greater variability than speech, including richer harmonic structures, sharp transients from instruments, and intentional distortions introduced by artistic expression
  3. Distortion Matching: Distortions such as perceptual coding artifacts that match or adapt to signal content are particularly difficult to isolate

Limitations of Existing Methods

  • Existing music foundation models (e.g., MERT, CLAP) are primarily optimized for downstream tasks such as music information retrieval and genre classification
  • It remains unclear which embeddings best reflect the perceptual aspects of music quality
  • Existing methods such as Fréchet Audio Distance (FAD) are highly sensitive to test sample size and reference signal selection, with limited reliability

Core Contributions

  1. Novel Approach: First application of weakly supervised labels and metric learning to fine-tune a music foundation model via LoRA in the general audio quality domain
  2. Innovative Training Strategy: Proposes a weakly supervised training objective based on Rank-n-Contrast (RnC) loss, combining ViSQOL pseudo-labels and encoding bitrate labels
  3. Superior Performance: Achieves highest overall correlation across multiple listening tests (PCC: 0.918, SRCC: 0.889)
  4. Strong Generalization: Demonstrates excellent performance on both in-domain coding artifact detection and out-of-domain source separation distortions
  5. Dual Reference Mode: Supports both full-reference and non-matching reference evaluation modes

Methodology Details

Task Definition

Construct an embedding function f:XZf: X \rightarrow Z that maps audio samples xiRDx_i \in \mathbb{R}^D to a quality embedding space ZZ, such that audio with similar perceptual quality are closer in the embedding space, while audio with large quality differences are farther apart.

Model Architecture

Foundation Model

  • MERT v1: 95M-parameter music foundation model using EnCodec as the tokenization method during pretraining
  • Architecture: 12 transformer layers, producing a 13×768-dimensional feature matrix per time frame
  • Feature Processing: Temporal dimension averaging followed by flattening to a 9,984-dimensional vector for subsequent projection heads

Projection Head Design

  • ReLU activation function + 256-dimensional linear layer output
  • Maps MERT features to quality-aware embedding space

Weakly Supervised Training Objective

Proxy Label Construction

  1. ViSQOL Labels: Uses ViSQOL v3 to compute MOS scores (1-5) for each degraded signal relative to clean reference
  2. Bitrate Labels: Encoding bitrate as a coarse indicator of audio quality, with clean signals assigned b=b = \infty

Rank-n-Contrast Loss

Single-sample RnC loss is defined as:

LRNCp(xi)=1N1j=1,jiNlogexp(f(xi)f(xj)2)xkSi,jpexp(f(xi)f(xk)2)L^p_{RNC}(x_i) = -\frac{1}{N-1} \sum_{j=1,j \neq i}^{N} \log \frac{\exp(\|f(x_i) - f(x_j)\|_2)}{\sum_{x_k \in S^p_{i,j}} \exp(\|f(x_i) - f(x_k)\|_2)}

where Si,jp:={xkXki,yipykpyipyjp}S^p_{i,j} := \{x_k \in X | k \neq i, |y^p_i - y^p_k| \geq |y^p_i - y^p_j|\} represents the set of samples ranked higher than xjx_j relative to anchor xix_i.

Overall Loss Function

LRNC=1N[i=1NLRNCViSQOL(xi)+xiXcodedLRNCp(xi)]L_{RNC} = \frac{1}{N}\left[\sum_{i=1}^{N} L^{ViSQOL}_{RNC}(x_i) + \sum_{x_i \in X_{coded}} L^p_{RNC}(x_i)\right]

Training Strategy

LoRA Fine-tuning

  • Inserts LoRA matrices into query and value projection layers of attention modules
  • Rank of 8 with scaling factor of 16
  • Only 2.93% of model parameters are trainable, effectively mitigating overfitting on small datasets

Training Configuration

  • Learning rate: 1×10⁻⁴, exponential decay by factor 0.99 after 10 epochs without improvement
  • Weight decay: 0.01, dropout rate: 0.05
  • Batch size: 32

Experimental Setup

Datasets

Training Data

  • Scale: Approximately 460 hours of CD-quality music (44.1 kHz)
  • Encoding Formats: Opus, mp3, AAC
  • Bitrates: 16, 32, 48, 64, 80, 96, 128 kbps
  • Data Split: 122 hours of encoded audio per codec, 45 hours of clean signals
  • Validation Set: 50 hours of music (8 hours clean + 14 hours encoded per codec)

Test Sets

Includes 9 listening tests divided into two categories:

  1. Audio Coding: IgorC96Multiformat, ODAQ, MPEG USAC validation tests (t1-t3)
  2. Source Separation: 4 subsets of SEBASS dataset (PEASS BAQ, SAOC DB, SASSEC, SiSEC08)

Evaluation Metrics

  • PCC: Pearson Linear Correlation Coefficient
  • SRCC: Spearman Rank Correlation Coefficient

Comparison Methods

  • Traditional Methods: ViSQOL v3, PEAQ ODG, 2f-model, HAAQI
  • Foundation Model Methods: Fine-tuned wav2vec 2.0, FAD (MERT-v1-95M)

Experimental Results

Main Results

Overall Performance

  • Highest Correlation: PCC = 0.918, SRCC = 0.889
  • Consistent Performance: Demonstrates high correlation and consistent performance across most test sets
  • Quality Range: Excellent performance in high-quality range, slightly insufficient in low-quality range due to scarce training data

Specific Test Performance

  1. IgorC96Multiformat: PCC = 0.954, SRCC = 0.848
  2. ODAQ Overall: PCC = 0.916, SRCC = 0.868
  3. USAC Tests: Achieves PCC above 0.9 in all t1-t3 tests
  4. Source Separation: Overall PCC = 0.919, SRCC = 0.787

Ablation Studies

Training Strategy Comparison

  • LoRA vs Full Fine-tuning: LoRA performs better on small datasets, with gap narrowing as data increases
  • LoRA vs Frozen Projection Head: LoRA significantly outperforms training only the projection head

Foundation Model Comparison

  • MERT vs wav2vec 2.0: MERT shows more balanced performance on music and speech, wav2vec 2.0 biased toward speech

Loss Function Analysis

  • Adding bitrate ordering RnC loss term provides 1-3% performance improvement

Mapping Function

  • Cubic polynomial and MLP mappings significantly improve PCC, with SRCC remaining essentially unchanged
  • Indicates non-linear relationship between embedding distance and subjective scores

Generalization Capability Analysis

  • In-domain Generalization: Excellent performance on coding artifact detection
  • Out-of-domain Generalization: Maintains good performance on unseen distortion types such as source separation
  • Cross-content Generalization: Consistent performance on music, speech, and mixed content

Speech Quality Assessment

  • Representative methods use triplet loss for contrastive learning
  • Leverage speech foundation models such as wav2vec 2.0 to encode signals
  • Reflect subjective degradation intensity through Euclidean distance between embeddings

Traditional Audio Quality Metrics

  • PEAQ: Extracts intermediate perceptual features (MOVs), combines through neural networks to produce ODG
  • 2f-model: Utilizes two MOVs from PEAQ Basic, achieving impressive correlation with subjective scores
  • HAAQI: Originally designed for hearing aid applications, bypassing hearing loss simulation for use with normal hearing

Music Foundation Model Applications

  • FAD: Used for evaluating generated music model embeddings, but sensitive to sample size and reference signal selection
  • MERT/CLAP: Primarily optimized for music information retrieval tasks

Conclusions and Discussion

Main Conclusions

  1. DeePAQ successfully extends the metric learning paradigm from speech quality assessment to the general audio domain
  2. LoRA fine-tuning strategy effectively prevents overfitting on small datasets
  3. Multi-source proxy labels (ViSQOL + bitrate) enhance model robustness
  4. Strong generalization capability makes it applicable to various distortion types

Limitations

  1. Low-Quality Range: Performance inferior to 2f-model in low-quality range due to scarce training data
  2. Source Separation Challenge: PEASS test set presents challenges for all objective metrics
  3. Training Data Constraints: Primarily focused on coding artifacts with limited coverage of other distortion types

Future Directions

  1. Expand Training Data: Include broader distortion types to enhance generalization capability
  2. Improve Non-Matching Reference Model: Enhance performance through more diverse training
  3. End-to-End Optimization: Explore methods directly optimizing subjective score prediction

In-Depth Evaluation

Strengths

  1. Strong Innovation: First application of LoRA and weakly supervised learning to audio quality assessment
  2. Sound Methodology: Ingenious RnC loss design effectively leverages multi-source proxy labels
  3. Comprehensive Experiments: Thorough evaluation across 9 different listening tests
  4. Strong Generalization: Excellent out-of-domain performance demonstrates method robustness

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why MERT is suitable for audio quality assessment
  2. Computational Complexity: Does not discuss computational overhead compared to traditional methods
  3. Limited Distortion Coverage: Primarily focuses on coding artifacts with insufficient coverage of other distortion types

Impact

  1. Academic Value: Provides new technical pathways for audio quality assessment research
  2. Practical Value: Applicable to audio codec development and quality monitoring
  3. Reproducibility: Detailed method description and clear experimental setup

Applicable Scenarios

  1. Audio Codec Evaluation: Particularly suitable for coding artifact detection
  2. Audio Processing System Quality Monitoring: Can be used for real-time quality assessment
  3. Multimedia Content Quality Control: Applicable to music and speech content quality assessment

References

The paper cites 26 important references covering core works in speech quality assessment, music foundation models, metric learning, and related fields, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality paper in the audio processing field, demonstrating excellence in methodological innovation, experimental design, and result analysis. DeePAQ brings significant technical breakthroughs to audio quality assessment, with important academic and practical significance.