2025-11-18T18:10:21.509375

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing

Cheng, Lu, Yang et al.

Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.

academic

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing

Basic Information

Paper ID: 2501.00804
Title: Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing
Authors: Gaofeng Cheng, Haitian Lu, Chengxu Yang, Xuyang Wang, Ta Li, Yonghong Yan
Categories: eess.AS (Audio and Speech Processing), cs.CL (Computational Linguistics)
Publication Date: January 1, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.00804

Abstract

Effectively distinguishing pronunciation correlations between different written texts is an important problem in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation dictionaries. This paper proposes a data-driven approach to automatically acquire these pronunciation correlations, termed Automatic Text Pronunciation Correlation (ATPC). The supervision required by this method is consistent with that needed to train end-to-end automatic speech recognition (E2E-ASR) systems, namely speech and corresponding text annotations. First, an Iterative Training Timestamp Estimator (ITSE) algorithm is employed to align speech with corresponding annotated text symbols. Subsequently, a speech encoder converts speech into speech embeddings. Finally, ATPC is obtained by comparing the speech embedding distances of different text symbols. Experimental results on Mandarin Chinese demonstrate that ATPC enhances E2E-ASR performance in contextual biasing and offers promise for dialects or languages lacking manual pronunciation dictionaries.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is how to automatically acquire pronunciation correlations between text symbols, which represents an important challenge in linguistic acoustics. Traditional methods rely on manually designed pronunciation dictionaries to establish such correlations, but this approach has obvious limitations.

Problem Significance

Pronunciation correlations play critical roles in multiple language processing tasks:

Automatic Speech Recognition (ASR): Accurate pronunciation modeling is crucial for recognition accuracy
Text-to-Speech (TTS): Requires accurate pronunciation information to generate natural speech
Contextual Biasing Recognition: Requires fine-grained understanding of pronunciation correlations to handle specific vocabulary

Limitations of Existing Methods

Dependence on Manual Dictionaries: Traditional methods require extensive manually constructed pronunciation dictionaries
Language Specificity: Each language requires specialized dictionary design
Labor Intensive: The manual construction process is time-consuming and resource-intensive
Insufficient Coverage: Difficult to encompass dialectal variants and specialized terminology

Research Motivation

While E2E-ASR models have achieved significant progress in speech-to-text modeling, they still fall short in effectively modeling text-to-text pronunciation correlations, particularly in contextual biasing scenarios requiring fine-grained pronunciation understanding.

Core Contributions

Proposes ATPC Method: First data-driven approach for automatic text pronunciation correlation generation without requiring manual pronunciation dictionaries
Unified Supervision Framework: Uses the same supervision signals as E2E-ASR (speech-text pairs), reducing additional annotation costs
Three-Stage Generation Pipeline: Designs a complete ATPC generation pipeline including alignment, embedding extraction, and correlation calculation
Experimental Validation: Validates ATPC effectiveness on Mandarin Chinese datasets for contextual biasing tasks
Open-Source Resources: Provides Chinese ATPC matrix as a public resource

Methodology Details

Task Definition

Input: Speech signal and corresponding text annotation
Output: Pronunciation correlation matrix between text symbols
Constraint: No requirement for additional pronunciation dictionaries or expert knowledge

Model Architecture

ATPC generation comprises three main stages:

1. ITSE-based Text-Speech Alignment

Purpose: Obtain precise start and end timestamps for each character
Method: Uses Iterative Training Timestamp Estimator (ITSE) algorithm
Advantages:
- Provides precise start and end timestamps compared to CTC
- Requires no pronunciation dictionary compared to GMM-HMM
- Performs token-level alignment based on E2E-ASR

2. Speech Embedding Extraction and Segmentation

Embedding Extraction: Uses multilingual speech representation models to extract sentence-level embeddings
Model Selection: Experiments with different layers of XLSR-53 and IPA-finetuned versions
Segmentation Strategy: Segments embeddings according to alignment results rather than audio segmentation
Frequency Setting: 50Hz extraction frequency (one frame per 20ms)

3. Pronunciation Correlation Calculation

Distance Metric: Employs Dynamic Time Warping (DTW) algorithm
Embedding Set Construction: Randomly selects E=100 embeddings for each character
Filtering Strategy: Removes characters appearing fewer than 3 times
Distance Calculation:

Dist(cj, ck) = (1/(M×N)) × Σ(m=1 to M)Σ(n=1 to N) DTW(V^m_j, W^n_k)

where cj and ck represent the j-th and k-th characters, and M and N are the respective numbers of embeddings for corresponding characters.

Technical Innovations

Dictionary-Free Alignment: ITSE algorithm enables precise alignment without pronunciation dictionaries
Embedding Segmentation Strategy: Performs segmentation in embedding space rather than audio space, preserving contextual information
DTW Distance Metric: Effectively handles distance calculation between embeddings of different lengths
Multilingual Pretraining: Leverages cross-lingual representation capabilities of multilingual models

Experimental Setup

Datasets

BABEL Subset: Used for training speech representation models
- Contains multilingual conversational telephone speech from 23 languages
- Languages include: Cantonese, Assamese, Bengali, Pashto, etc.
Aishell-2 Training Set: Used for training ITSE and generating ATPC
- Mandarin Chinese speech corpus
- Validates cross-lingual performance
Aishell-1 Contextual Biasing Dataset: Used for evaluating ATPC effectiveness
- Development set: 1,334 sentences, 600 hot words
- Test set: 235 sentences, 161 hot words

Evaluation Metrics

Pronunciation Discrimination Ability:
- DTW distance between homophones and non-homophones
- Relative Disparity
Contextual Biasing Performance:
- Character Error Rate (CER)
- Biased Character Error Rate (B-CER)
- Unbiased Character Error Rate (U-CER)
- Hot word Recall/Precision/F1 score (R/P/F)

Comparison Methods

Shallow Fusion: WFST-based contextual decoding graph method
Deep Biasing: Context Phrase Prediction Network (CPPN) based on AED-CTC structure
Manual Dictionary: Method using hand-crafted pronunciation dictionaries

Implementation Details

Backbone Model: XLSR-53, finetuned on BABEL IPA recognition task
Embedding Layer Selection: Layer 15 embeddings show best performance
Distance Function: Cosine distance outperforms Euclidean distance
Threshold Setting: Contextual biasing threshold of 1.07
Matrix Scale: 3711×3711 ATPC matrix

Experimental Results

Main Results

Pronunciation Discrimination Ability Assessment

Model	Euclidean Distance	Cosine Distance	Relative Disparity
XLSR-layer15	Homophones: 105.67, Non-homophones: 131.66	Homophones: 0.183, Non-homophones: 0.258	19.7% / 29.1%
IPA-layer15	Homophones: 394.47, Non-homophones: 499.87	Homophones: 0.136, Non-homophones: 0.191	21.1% / 28.8%

Key Findings:

IPA-finetuned models consistently outperform XLSR-53 in pronunciation discrimination
Layer 15 embeddings show best performance in most cases
Cosine distance consistently outperforms Euclidean distance

Contextual Biasing Performance

Method	CER (U-CER/B-CER)	F1 Score (Recall/Precision)
Baseline	13.8 (7.3/41.8)	44 (28/99)
ATPC	12.0 (7.3/32.4)	68 (53/96)
C-g + ATPC	10.3 (7.7/21.5)	80 (70/94)
C-g + Manual Dictionary	8.9 (7.4/15.3)	86 (77/98)

Performance Improvements:

13.0% relative CER reduction compared to baseline
22.5% relative B-CER reduction
25% improvement in hot word recall
24% improvement in F1 score

Ablation Studies

Comparison of Different Layer Embeddings

Experiments demonstrate that layer 15 embeddings achieve optimal performance in pronunciation discrimination tasks, likely because this layer achieves the best balance between acoustic features, speech characteristics, lexical identity, and lexical semantic information.

Distance Function Comparison

Cosine distance outperforms Euclidean distance across all configurations, with significant improvements in relative disparity (e.g., IPA-layer15 improves from 21.1% to 28.8%).

Case Analysis

ATPC Matrix Visualization

Through visualization analysis, the following observations are made:

Lower DTW distance between homophones "刮" (gua1) and "瓜" (gua1)
Higher DTW distance between non-homophones "爱" (ai4) and "途" (tu2)
The overall matrix reflects pronunciation correlations among Mandarin Chinese characters

Experimental Findings

Cross-lingual Transfer Capability: Models pretrained on multilingual data effectively transfer to Mandarin Chinese
Layer Representation Differences: Different layers encode different types of information, with middle layers more suitable for pronunciation modeling
Distance Metric Importance: Cosine distance is more effective for capturing pronunciation similarity
Practical Validation: ATPC as a plug-and-play module effectively enhances ASR performance

Pronunciation Modeling Research

Traditional pronunciation modeling primarily relies on:

HMM-GMM Systems: Require detailed pronunciation dictionaries and phoneme alignment
Deep Learning Methods: Still depend on manually constructed pronunciation resources
End-to-End Systems: While reducing dependence on intermediate representations, still fall short in pronunciation correlation modeling

Contextual Biasing Methods

Shallow Fusion: Integrates contextual information during decoding
Deep Biasing: Integrates context-aware mechanisms within the model
This Work's Contribution: Provides a new approach to pronunciation correlation modeling

Speech Representation Learning

Self-Supervised Learning: Models like wav2vec and XLSR provide powerful speech representations
Multilingual Models: Provide foundation for cross-lingual pronunciation modeling
Layer Analysis: Different layers capture information at different abstraction levels

Conclusions and Discussion

Main Conclusions

Method Effectiveness: ATPC successfully achieves automatic pronunciation correlation generation without manual dictionaries
Performance Improvements: Achieves significant improvements in contextual biasing tasks
Practical Value: Provides solutions for languages/dialects lacking pronunciation resources
Plug-and-Play: Easy to integrate into existing ASR systems as a module

Limitations

Performance Gap: Still shows performance gap compared to manual dictionaries
Data Dependence: Requires sufficient training data to ensure correlation quality
Computational Complexity: Overhead from DTW computation and large-scale matrix storage
Language Specificity: Primarily validated on Mandarin Chinese; generalization to other languages remains to be verified

Future Directions

Multilingual Extension: Generate and apply ATPC across more languages and dialects
OOV Handling: Address challenges in handling out-of-vocabulary characters or words
Data Scale: Leverage larger datasets to enhance ATPC robustness
Resource Standardization: Advance ATPC standardization as a public speech resource with continuous updates

In-Depth Evaluation

Strengths

Strong Innovation: First completely data-driven approach for pronunciation correlation generation
High Practical Value: Addresses real problems in resource-scarce languages
Complete Methodology: Provides end-to-end solution
Comprehensive Experiments: Validates method effectiveness from multiple perspectives
Open-Source Contribution: Provides reproducible implementation and public resources

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why the method works
Evaluation Limitations: Primarily evaluated on Mandarin Chinese; multilingual generalization not fully verified
Computational Efficiency: High time complexity of DTW computation
Missing Error Analysis: Lacks in-depth analysis of failure cases and error patterns

Impact

Academic Contribution: Provides new research direction for pronunciation modeling
Practical Application: Significant value for ASR systems in resource-scarce languages
Technology Promotion: Simple and easy-to-implement method facilitates widespread adoption
Resource Sharing: Open-source ATPC matrix provides valuable resource to the community

Applicable Scenarios

Resource-Scarce Languages: Languages or dialects lacking pronunciation dictionaries
Rapid Deployment: Scenarios requiring quick ASR system construction
Contextual Biasing: Applications requiring handling of specialized vocabulary or hot words
Multilingual Systems: Building unified multilingual speech processing systems

References

The paper cites 26 important references, covering:

Classical work in speech recognition and TTS
Latest advances in end-to-end ASR
Related research on contextual biasing
Cutting-edge achievements in speech representation learning
Important contributions to multilingual speech processing

Overall Assessment: This is a research work with significant practical value, proposing an innovative data-driven method to address the practical problem of pronunciation correlation modeling. While there is room for improvement in theoretical depth and multilingual validation, the simplicity and practicality of the method make it promising for real-world applications.