2025-11-18T18:10:21.509375

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing

Cheng, Lu, Yang et al.
Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.
academic

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing

Basic Information

  • Paper ID: 2501.00804
  • Title: Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing
  • Authors: Gaofeng Cheng, Haitian Lu, Chengxu Yang, Xuyang Wang, Ta Li, Yonghong Yan
  • Categories: eess.AS (Audio and Speech Processing), cs.CL (Computational Linguistics)
  • Publication Date: January 1, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00804

Abstract

Effectively distinguishing pronunciation correlations between different written texts is an important problem in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation dictionaries. This paper proposes a data-driven approach to automatically acquire these pronunciation correlations, termed Automatic Text Pronunciation Correlation (ATPC). The supervision required by this method is consistent with that needed to train end-to-end automatic speech recognition (E2E-ASR) systems, namely speech and corresponding text annotations. First, an Iterative Training Timestamp Estimator (ITSE) algorithm is employed to align speech with corresponding annotated text symbols. Subsequently, a speech encoder converts speech into speech embeddings. Finally, ATPC is obtained by comparing the speech embedding distances of different text symbols. Experimental results on Mandarin Chinese demonstrate that ATPC enhances E2E-ASR performance in contextual biasing and offers promise for dialects or languages lacking manual pronunciation dictionaries.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is how to automatically acquire pronunciation correlations between text symbols, which represents an important challenge in linguistic acoustics. Traditional methods rely on manually designed pronunciation dictionaries to establish such correlations, but this approach has obvious limitations.

Problem Significance

Pronunciation correlations play critical roles in multiple language processing tasks:

  1. Automatic Speech Recognition (ASR): Accurate pronunciation modeling is crucial for recognition accuracy
  2. Text-to-Speech (TTS): Requires accurate pronunciation information to generate natural speech
  3. Contextual Biasing Recognition: Requires fine-grained understanding of pronunciation correlations to handle specific vocabulary

Limitations of Existing Methods

  1. Dependence on Manual Dictionaries: Traditional methods require extensive manually constructed pronunciation dictionaries
  2. Language Specificity: Each language requires specialized dictionary design
  3. Labor Intensive: The manual construction process is time-consuming and resource-intensive
  4. Insufficient Coverage: Difficult to encompass dialectal variants and specialized terminology

Research Motivation

While E2E-ASR models have achieved significant progress in speech-to-text modeling, they still fall short in effectively modeling text-to-text pronunciation correlations, particularly in contextual biasing scenarios requiring fine-grained pronunciation understanding.

Core Contributions

  1. Proposes ATPC Method: First data-driven approach for automatic text pronunciation correlation generation without requiring manual pronunciation dictionaries
  2. Unified Supervision Framework: Uses the same supervision signals as E2E-ASR (speech-text pairs), reducing additional annotation costs
  3. Three-Stage Generation Pipeline: Designs a complete ATPC generation pipeline including alignment, embedding extraction, and correlation calculation
  4. Experimental Validation: Validates ATPC effectiveness on Mandarin Chinese datasets for contextual biasing tasks
  5. Open-Source Resources: Provides Chinese ATPC matrix as a public resource

Methodology Details

Task Definition

Input: Speech signal and corresponding text annotation
Output: Pronunciation correlation matrix between text symbols
Constraint: No requirement for additional pronunciation dictionaries or expert knowledge

Model Architecture

ATPC generation comprises three main stages:

1. ITSE-based Text-Speech Alignment

  • Purpose: Obtain precise start and end timestamps for each character
  • Method: Uses Iterative Training Timestamp Estimator (ITSE) algorithm
  • Advantages:
    • Provides precise start and end timestamps compared to CTC
    • Requires no pronunciation dictionary compared to GMM-HMM
    • Performs token-level alignment based on E2E-ASR

2. Speech Embedding Extraction and Segmentation

  • Embedding Extraction: Uses multilingual speech representation models to extract sentence-level embeddings
  • Model Selection: Experiments with different layers of XLSR-53 and IPA-finetuned versions
  • Segmentation Strategy: Segments embeddings according to alignment results rather than audio segmentation
  • Frequency Setting: 50Hz extraction frequency (one frame per 20ms)

3. Pronunciation Correlation Calculation

  • Distance Metric: Employs Dynamic Time Warping (DTW) algorithm
  • Embedding Set Construction: Randomly selects E=100 embeddings for each character
  • Filtering Strategy: Removes characters appearing fewer than 3 times
  • Distance Calculation:
Dist(cj, ck) = (1/(M×N)) × Σ(m=1 to M)Σ(n=1 to N) DTW(V^m_j, W^n_k)

where cj and ck represent the j-th and k-th characters, and M and N are the respective numbers of embeddings for corresponding characters.

Technical Innovations

  1. Dictionary-Free Alignment: ITSE algorithm enables precise alignment without pronunciation dictionaries
  2. Embedding Segmentation Strategy: Performs segmentation in embedding space rather than audio space, preserving contextual information
  3. DTW Distance Metric: Effectively handles distance calculation between embeddings of different lengths
  4. Multilingual Pretraining: Leverages cross-lingual representation capabilities of multilingual models

Experimental Setup

Datasets

  1. BABEL Subset: Used for training speech representation models
    • Contains multilingual conversational telephone speech from 23 languages
    • Languages include: Cantonese, Assamese, Bengali, Pashto, etc.
  2. Aishell-2 Training Set: Used for training ITSE and generating ATPC
    • Mandarin Chinese speech corpus
    • Validates cross-lingual performance
  3. Aishell-1 Contextual Biasing Dataset: Used for evaluating ATPC effectiveness
    • Development set: 1,334 sentences, 600 hot words
    • Test set: 235 sentences, 161 hot words

Evaluation Metrics

  1. Pronunciation Discrimination Ability:
    • DTW distance between homophones and non-homophones
    • Relative Disparity
  2. Contextual Biasing Performance:
    • Character Error Rate (CER)
    • Biased Character Error Rate (B-CER)
    • Unbiased Character Error Rate (U-CER)
    • Hot word Recall/Precision/F1 score (R/P/F)

Comparison Methods

  1. Shallow Fusion: WFST-based contextual decoding graph method
  2. Deep Biasing: Context Phrase Prediction Network (CPPN) based on AED-CTC structure
  3. Manual Dictionary: Method using hand-crafted pronunciation dictionaries

Implementation Details

  • Backbone Model: XLSR-53, finetuned on BABEL IPA recognition task
  • Embedding Layer Selection: Layer 15 embeddings show best performance
  • Distance Function: Cosine distance outperforms Euclidean distance
  • Threshold Setting: Contextual biasing threshold of 1.07
  • Matrix Scale: 3711×3711 ATPC matrix

Experimental Results

Main Results

Pronunciation Discrimination Ability Assessment

ModelEuclidean DistanceCosine DistanceRelative Disparity
XLSR-layer15Homophones: 105.67, Non-homophones: 131.66Homophones: 0.183, Non-homophones: 0.25819.7% / 29.1%
IPA-layer15Homophones: 394.47, Non-homophones: 499.87Homophones: 0.136, Non-homophones: 0.19121.1% / 28.8%

Key Findings:

  • IPA-finetuned models consistently outperform XLSR-53 in pronunciation discrimination
  • Layer 15 embeddings show best performance in most cases
  • Cosine distance consistently outperforms Euclidean distance

Contextual Biasing Performance

MethodCER (U-CER/B-CER)F1 Score (Recall/Precision)
Baseline13.8 (7.3/41.8)44 (28/99)
ATPC12.0 (7.3/32.4)68 (53/96)
C-g + ATPC10.3 (7.7/21.5)80 (70/94)
C-g + Manual Dictionary8.9 (7.4/15.3)86 (77/98)

Performance Improvements:

  • 13.0% relative CER reduction compared to baseline
  • 22.5% relative B-CER reduction
  • 25% improvement in hot word recall
  • 24% improvement in F1 score

Ablation Studies

Comparison of Different Layer Embeddings

Experiments demonstrate that layer 15 embeddings achieve optimal performance in pronunciation discrimination tasks, likely because this layer achieves the best balance between acoustic features, speech characteristics, lexical identity, and lexical semantic information.

Distance Function Comparison

Cosine distance outperforms Euclidean distance across all configurations, with significant improvements in relative disparity (e.g., IPA-layer15 improves from 21.1% to 28.8%).

Case Analysis

ATPC Matrix Visualization

Through visualization analysis, the following observations are made:

  • Lower DTW distance between homophones "刮" (gua1) and "瓜" (gua1)
  • Higher DTW distance between non-homophones "爱" (ai4) and "途" (tu2)
  • The overall matrix reflects pronunciation correlations among Mandarin Chinese characters

Experimental Findings

  1. Cross-lingual Transfer Capability: Models pretrained on multilingual data effectively transfer to Mandarin Chinese
  2. Layer Representation Differences: Different layers encode different types of information, with middle layers more suitable for pronunciation modeling
  3. Distance Metric Importance: Cosine distance is more effective for capturing pronunciation similarity
  4. Practical Validation: ATPC as a plug-and-play module effectively enhances ASR performance

Pronunciation Modeling Research

Traditional pronunciation modeling primarily relies on:

  1. HMM-GMM Systems: Require detailed pronunciation dictionaries and phoneme alignment
  2. Deep Learning Methods: Still depend on manually constructed pronunciation resources
  3. End-to-End Systems: While reducing dependence on intermediate representations, still fall short in pronunciation correlation modeling

Contextual Biasing Methods

  1. Shallow Fusion: Integrates contextual information during decoding
  2. Deep Biasing: Integrates context-aware mechanisms within the model
  3. This Work's Contribution: Provides a new approach to pronunciation correlation modeling

Speech Representation Learning

  1. Self-Supervised Learning: Models like wav2vec and XLSR provide powerful speech representations
  2. Multilingual Models: Provide foundation for cross-lingual pronunciation modeling
  3. Layer Analysis: Different layers capture information at different abstraction levels

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: ATPC successfully achieves automatic pronunciation correlation generation without manual dictionaries
  2. Performance Improvements: Achieves significant improvements in contextual biasing tasks
  3. Practical Value: Provides solutions for languages/dialects lacking pronunciation resources
  4. Plug-and-Play: Easy to integrate into existing ASR systems as a module

Limitations

  1. Performance Gap: Still shows performance gap compared to manual dictionaries
  2. Data Dependence: Requires sufficient training data to ensure correlation quality
  3. Computational Complexity: Overhead from DTW computation and large-scale matrix storage
  4. Language Specificity: Primarily validated on Mandarin Chinese; generalization to other languages remains to be verified

Future Directions

  1. Multilingual Extension: Generate and apply ATPC across more languages and dialects
  2. OOV Handling: Address challenges in handling out-of-vocabulary characters or words
  3. Data Scale: Leverage larger datasets to enhance ATPC robustness
  4. Resource Standardization: Advance ATPC standardization as a public speech resource with continuous updates

In-Depth Evaluation

Strengths

  1. Strong Innovation: First completely data-driven approach for pronunciation correlation generation
  2. High Practical Value: Addresses real problems in resource-scarce languages
  3. Complete Methodology: Provides end-to-end solution
  4. Comprehensive Experiments: Validates method effectiveness from multiple perspectives
  5. Open-Source Contribution: Provides reproducible implementation and public resources

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why the method works
  2. Evaluation Limitations: Primarily evaluated on Mandarin Chinese; multilingual generalization not fully verified
  3. Computational Efficiency: High time complexity of DTW computation
  4. Missing Error Analysis: Lacks in-depth analysis of failure cases and error patterns

Impact

  1. Academic Contribution: Provides new research direction for pronunciation modeling
  2. Practical Application: Significant value for ASR systems in resource-scarce languages
  3. Technology Promotion: Simple and easy-to-implement method facilitates widespread adoption
  4. Resource Sharing: Open-source ATPC matrix provides valuable resource to the community

Applicable Scenarios

  1. Resource-Scarce Languages: Languages or dialects lacking pronunciation dictionaries
  2. Rapid Deployment: Scenarios requiring quick ASR system construction
  3. Contextual Biasing: Applications requiring handling of specialized vocabulary or hot words
  4. Multilingual Systems: Building unified multilingual speech processing systems

References

The paper cites 26 important references, covering:

  • Classical work in speech recognition and TTS
  • Latest advances in end-to-end ASR
  • Related research on contextual biasing
  • Cutting-edge achievements in speech representation learning
  • Important contributions to multilingual speech processing

Overall Assessment: This is a research work with significant practical value, proposing an innovative data-driven method to address the practical problem of pronunciation correlation modeling. While there is room for improvement in theoretical depth and multilingual validation, the simplicity and practicality of the method make it promising for real-world applications.