CTC compressor can be an effective approach to integrate audio encoders to decoder-only models, which has gained growing interest for different speech applications. In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. CJST matches speech and text modalities from both directions by exploring a simple modality adaptor and several features of the CTC compressor, including sequence compression, on-the-fly forced peaky alignment and CTC class embeddings. Experimental results on the Librispeech and TED-LIUM2 corpora show that the proposed CJST achieves an effective text injection without the need of duration handling, leading to the best performance for both in-domain and cross-domain scenarios. We also provide a comprehensive study on CTC compressor, covering various compression modes, edge case handling and behavior under both clean and noisy data conditions, which reveals the most robust setting to use CTC compressor for decoder-only models.
CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR
- Paper ID: 2411.07607
- Title: CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR
- Authors: Wei Zhou, Junteng Jia, Leda Sari, Jay Mahadeokar, Ozlem Kalinli (Meta AI)
- Categories: eess.AS cs.LG cs.SD
- Publication Date: November 2024 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2411.07607
CTC compressors have emerged as an effective method for integrating audio encoders into decoder-only models, gaining increasing attention across various speech applications. This paper proposes a novel CTC compressor-based joint speech and text training (CJST) framework for decoder-only ASR. CJST achieves bidirectional modality matching between speech and text by exploring simple modal adapters and several properties of CTC compressors, including sequence compression, online forced peak alignment, and CTC class embeddings. Experimental results on Librispeech and TED-LIUM2 corpora demonstrate that the proposed CJST achieves effective text injection without requiring duration processing, attaining state-of-the-art performance in both in-domain and cross-domain scenarios.
With the tremendous success of large language models (LLMs), decoder-only architectures have been widely adopted in various speech applications. However, how to effectively integrate speech information into decoder-only models and how to conduct joint speech-text training to enhance ASR performance remain challenging problems.
- Integration Challenge: Effectively integrating continuous acoustic embeddings into decoder-only models requires appropriate adapter methods
- Modality Matching: Speech and text modalities exhibit significant differences in sequence length and representation space, requiring effective alignment mechanisms
- Text Injection: In production-level ASR models, how to effectively leverage text data without external language models to improve performance
- Simple Adapters: Traditional temporal reduction layers plus linear projection methods lack content-aware compression capabilities
- RNN-T Methods: Existing joint training methods primarily target RNN-T models, requiring complex duration processing
- CTC Compressor Sensitivity: Existing CTC compressor methods show unstable performance on noisy data
- Proposes CJST Framework: A novel joint speech-text training framework based on CTC compressors, achieving bidirectional modality matching
- Extends CTC Compressor: Comprehensive investigation of various compression modes, edge case handling, and behavior on clean/noisy data
- Duration-Free Processing: Achieves effective text injection through online forced peak alignment and CTC class embeddings, without requiring complex duration modeling
- Performance Improvement: Achieves state-of-the-art performance in both in-domain and cross-domain scenarios, with approximately 6% relative improvement over baselines
This paper investigates automatic speech recognition for decoder-only architectures, with speech feature sequences as input and corresponding text transcriptions as output. It simultaneously considers how to leverage paired speech-text data and pure text data for joint training.
The paper investigates four CTC compressor compression modes:
- Blank Prediction Removal: Based on greedy CTC predictions, removes all blank frames
- Same Prediction Averaging: Averages adjacent frames with identical predictions
- Blank Probability Removal: Removes all frames where blank probability exceeds a predefined threshold
- Combination Mode: First applies blank probability removal, then applies same prediction averaging
To address the issue of CTC compressors potentially producing empty outputs, two solutions are proposed:
- Empty Skip: Skips these utterances during training, directly outputs EOS during inference
- Empty Fallback: Averages all encoder outputs into a single frame, then proceeds with normal training and inference
Explores sharing mechanisms between CTC class embeddings and text embeddings, bringing audio encoder outputs closer to text embeddings through the CTC objective function.
For paired speech-text data:
- Conducts regular ASR training through model forward propagation
- Utilizes compressed acoustic embeddings h' and CTC probabilities for forced peak alignment
- Trains modal adapters via MSE loss to align h' with pseudo-acoustic embeddings h'_text
For pure text data:
- Randomly inserts blank symbols based on recorded length ratio R_len(h', y)
- Generates pseudo-acoustic prompts h'_text through CTC embeddings and modal adapters
- Trains the decoder model using the ASR objective function
- Applies 20% random masking to h'_text to maintain learning difficulty
Uses a simple Conformer layer as the modal adapter, featuring a single attention head, convolution kernel size of 3, without dimension expansion in the feed-forward module.
- Librispeech: 960 hours of clean speech data
- Internal Data: 2M hours of acoustically diverse data, including speed perturbation, simulated reverberation, and random background noise
- Text Data: LM training text data from Librispeech and TED-LIUM2
- Decoder: 12-layer LLaMA decoder, 768 hidden dimensions, 12 attention heads
- Audio Encoder: 24-layer Conformer, 512 hidden dimensions, 8 attention heads
- Vocabulary: 4k SentencePiece units per dataset
- Audio encoder pre-training: 200k steps
- Full model training: 200k steps on Librispeech, 500k steps on internal data
- Joint training: speech and text loss weights both set to 1.0
- Auxiliary CTC loss weight: 0.5
Word Error Rate (WER) is used as the primary evaluation metric, reported on test sets.
- All CTC compressor methods outperform simple adapter methods
- Blank probability removal (threshold 0.95) achieves best performance: test-clean 2.17%, test-other 4.94%
- Embedding sharing helps in some cases but lacks consistency
- Greedy prediction-based methods perform poorly on noisy data
- Blank probability removal (threshold 0.95) is most robust: 12.85% WER
- Empty fallback scheme outperforms empty skip scheme
Results on Librispeech:
- Baseline adapter: test-clean 3.38%, test-other 5.63%
- LM-like text injection: test-clean 2.54%, test-other 5.26%
- CJST: test-clean 2.09%, test-other 4.71%
Using in-domain and cross-domain text data:
- CJST achieves best performance across all scenarios
- Cross-domain TED-LIUM2 test set: reduced from 11.45% to 10.14%
- Approximately 6% relative improvement over baseline
- Blank probability removal is the most robust compression mode
- LM-like training is already quite effective, serving as a strong baseline
- CJST provides further improvements across all scenarios
- CTC compressor is sensitive to data quality, requiring appropriate configuration
- Early work uses simple adapters to integrate audio encoders
- Recent research explores discrete audio token approaches
- This paper focuses on ASR tasks with continuous representations
- Originally used for attention-based speech translation
- Extended to speech translation in decoder-only models
- This paper is the first to systematically investigate its application in ASR
- Traditional methods primarily target RNN-T models
- Include JOIST, textogram, MAESTRO, and other methods
- This paper is the first to propose an effective solution for decoder-only ASR
- CJST Framework is Effective: Achieves effective text injection through bidirectional modality matching
- CTC Compressor Configuration is Critical: Blank probability removal (high threshold) is most robust
- Duration-Free Processing: Avoids complex duration modeling through forced alignment and CTC embeddings
- Consistent Improvement: Achieves significant improvements in both in-domain and cross-domain scenarios
- Computational Overhead: Online forced alignment increases training computational cost
- Data Dependency: CTC compressor performance is highly dependent on data quality
- Parameter Sensitivity: Requires careful tuning of hyperparameters such as blank probability threshold
- Evaluation Scope: Primarily evaluated on English data; multilingual generalization remains unknown
- Explore more efficient online alignment methods
- Investigate performance in multilingual and low-resource scenarios
- Combine hybrid approaches with discrete audio tokens
- Optimize robustness of CTC compressor
- Methodological Innovation: First application of CTC compressor to joint speech-text training for decoder-only ASR
- Systematic Investigation: Comprehensive experimental analysis of CTC compressor
- Practical Value: Duration-free processing simplifies implementation complexity
- Sufficient Experimentation: Validates method effectiveness across multiple datasets and scenarios
- Clear Writing: Well-structured paper with detailed technical descriptions
- Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why CJST is effective
- Computational Cost: Lacks detailed analysis of training and inference computational overhead
- Hyperparameter Sensitivity: Method involves multiple hyperparameters with complex tuning requirements
- Evaluation Limitations: Primarily evaluated on English data; lacks multilingual validation
- Academic Contribution: Provides new insights for text injection in decoder-only ASR
- Practical Value: Relatively simple method, easy to deploy in production environments
- Reproducibility: Provides detailed implementation details and hyperparameter settings
- Inspirational Value: Offers valuable insights for further research on CTC compressors
- Production-Level ASR: Suitable for scenarios where external language models cannot be used
- Cross-Domain Adaptation: Particularly suitable for applications requiring rapid domain adaptation
- Resource-Constrained Settings: More efficient than complex duration modeling methods
- Joint Training: Suitable for scenarios with abundant text data but relatively limited speech data
The paper cites 32 relevant references, covering important works in large language models, decoder-only architectures, CTC methods, speech recognition, and joint training, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality technical paper proposing an innovative CJST framework that addresses the important problem of joint speech-text training in decoder-only ASR. The experimental design is comprehensive, results are convincing, and the work has significant academic and practical value for the field.