Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning
Nagpal, Venugopalan, Tobin et al.
We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
academic
Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning
This paper proposes a large language model (LLM) capable of processing speech input and demonstrates that further fine-tuning through reinforcement learning from human feedback (RLHF) can better adapt to disordered speech compared to traditional supervised fine-tuning. The method replaces low-frequency text tokens in the LLM vocabulary with audio tokens, enabling speech recognition through fine-tuning on speech transcription data. Subsequently, reinforcement learning with rewards based on syntactic and semantic accuracy metrics is employed to further generalize the LLM for recognizing disordered speech. While the resulting model does not surpass existing systems in speech recognition, the study finds that reinforcement learning fine-tuning with custom rewards significantly outperforms supervised fine-tuning of language models when adapting to speech in different settings.
Multimodal Capability Extension: Enhancing LLMs' audio processing capabilities while maintaining language understanding is crucial for speech-controlled automation applications
Accessibility Technology: Speech recognition systems that can incorporate visual and textual context hold particular social value for individuals with speech disorders
Low-Resource Scenario Adaptation: Model adaptation in low-resource scenarios such as disordered speech represents an important technical challenge
Complex Architecture Modifications: Most existing work requires modifying LLM architecture or using speech encoders to extract embeddings
Vocabulary Expansion Costs: Some approaches handle audio by expanding the LLM vocabulary, increasing computational costs
Limited Evaluation Metrics: Traditional ASR systems primarily rely on syntactic metrics like WER, with insufficient evaluation of semantic preservation
Difficulty in Disordered Speech Adaptation: Traditional fine-tuning methods show limited effectiveness in adapting to disordered speech
Proposed LLM speech recognition method without architecture modification: By mapping audio tokens to low-frequency text tokens in the existing vocabulary, avoiding architectural changes
Introduced RLHF-based ASR domain adaptation strategy: Using combined rewards of WER and meaning preservation (MP) scores for reinforcement learning optimization
Achieved significant improvements in disordered speech recognition: Compared to supervised fine-tuning, the RLHF method achieved substantial performance improvements on the Euphonia dataset
Provided new perspectives on semantic preservation evaluation: Comprehensive evaluation combining syntactic accuracy (WER) and semantic accuracy (MP)
Input: Raw audio signals
Output: Corresponding text transcriptions
Constraints: Maintain the original LLM architecture unchanged while adapting to disordered speech domain
γ=1.00: Highest MP scores with slight WER increase but no statistical significance (p=0.54)
Severity-Level Analysis:
The RLHF model shows MP score improvements across all severity levels, with more pronounced improvements on moderate and severe disordered speech.
RLHF significantly outperforms supervised fine-tuning: The RLHF method achieves substantial improvements over continued supervised fine-tuning in disordered speech adaptation tasks
Effectiveness of multi-objective rewards: Reward functions combining WER and MP achieve good balance between syntactic and semantic accuracy
Importance of semantic preservation: In disordered speech recognition, semantic preservation is more important than strict word matching
The paper cites 35 relevant references covering important work in LLM multimodal extension, speech recognition, reinforcement learning, and other domains, providing a solid theoretical foundation for the research.
Overall Assessment: This paper has significant importance in both technical innovation and social value. The proposed architecture-free LLM speech recognition method and RLHF domain adaptation strategy provide new perspectives for related research. While there is room for improvement in absolute performance, the substantial improvements in the important application scenario of disordered speech recognition demonstrate the practical value of this method.