Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing
Cheng, Lu, Yang et al.
Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.
academic
Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing
Effectively distinguishing pronunciation correlations between different written texts is an important problem in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation dictionaries. This paper proposes a data-driven approach to automatically acquire these pronunciation correlations, termed Automatic Text Pronunciation Correlation (ATPC). The supervision required by this method is consistent with that needed to train end-to-end automatic speech recognition (E2E-ASR) systems, namely speech and corresponding text annotations. First, an Iterative Training Timestamp Estimator (ITSE) algorithm is employed to align speech with corresponding annotated text symbols. Subsequently, a speech encoder converts speech into speech embeddings. Finally, ATPC is obtained by comparing the speech embedding distances of different text symbols. Experimental results on Mandarin Chinese demonstrate that ATPC enhances E2E-ASR performance in contextual biasing and offers promise for dialects or languages lacking manual pronunciation dictionaries.
The core problem addressed by this research is how to automatically acquire pronunciation correlations between text symbols, which represents an important challenge in linguistic acoustics. Traditional methods rely on manually designed pronunciation dictionaries to establish such correlations, but this approach has obvious limitations.
While E2E-ASR models have achieved significant progress in speech-to-text modeling, they still fall short in effectively modeling text-to-text pronunciation correlations, particularly in contextual biasing scenarios requiring fine-grained pronunciation understanding.
Proposes ATPC Method: First data-driven approach for automatic text pronunciation correlation generation without requiring manual pronunciation dictionaries
Unified Supervision Framework: Uses the same supervision signals as E2E-ASR (speech-text pairs), reducing additional annotation costs
Three-Stage Generation Pipeline: Designs a complete ATPC generation pipeline including alignment, embedding extraction, and correlation calculation
Experimental Validation: Validates ATPC effectiveness on Mandarin Chinese datasets for contextual biasing tasks
Open-Source Resources: Provides Chinese ATPC matrix as a public resource
Input: Speech signal and corresponding text annotation Output: Pronunciation correlation matrix between text symbols Constraint: No requirement for additional pronunciation dictionaries or expert knowledge
Experiments demonstrate that layer 15 embeddings achieve optimal performance in pronunciation discrimination tasks, likely because this layer achieves the best balance between acoustic features, speech characteristics, lexical identity, and lexical semantic information.
Cosine distance outperforms Euclidean distance across all configurations, with significant improvements in relative disparity (e.g., IPA-layer15 improves from 21.1% to 28.8%).
The paper cites 26 important references, covering:
Classical work in speech recognition and TTS
Latest advances in end-to-end ASR
Related research on contextual biasing
Cutting-edge achievements in speech representation learning
Important contributions to multilingual speech processing
Overall Assessment: This is a research work with significant practical value, proposing an innovative data-driven method to address the practical problem of pronunciation correlation modeling. While there is room for improvement in theoretical depth and multilingual validation, the simplicity and practicality of the method make it promising for real-world applications.