2025-11-13T12:49:11.039710

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning

Nagpal, Venugopalan, Tobin et al.
We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
academic

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning

Basic Information

  • Paper ID: 2501.00039
  • Title: Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning
  • Authors: Chirag Nagpal, Subhashini Venugopalan, Jimmy Tobin, Marilyn Ladewig, Katherine Heller, Katrin Tomanek (Google Research)
  • Classification: eess.AS cs.CL cs.LG cs.SD
  • Publication Date: December 25, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00039

Abstract

This paper proposes a large language model (LLM) capable of processing speech input and demonstrates that further fine-tuning through reinforcement learning from human feedback (RLHF) can better adapt to disordered speech compared to traditional supervised fine-tuning. The method replaces low-frequency text tokens in the LLM vocabulary with audio tokens, enabling speech recognition through fine-tuning on speech transcription data. Subsequently, reinforcement learning with rewards based on syntactic and semantic accuracy metrics is employed to further generalize the LLM for recognizing disordered speech. While the resulting model does not surpass existing systems in speech recognition, the study finds that reinforcement learning fine-tuning with custom rewards significantly outperforms supervised fine-tuning of language models when adapting to speech in different settings.

Research Background and Motivation

Problem Definition

This research addresses two core challenges:

  1. How to enable existing LLMs to process speech input and perform speech recognition
  2. How to effectively adapt LLM-based ASR systems to disordered speech recognition tasks

Significance

  • Multimodal Capability Extension: Enhancing LLMs' audio processing capabilities while maintaining language understanding is crucial for speech-controlled automation applications
  • Accessibility Technology: Speech recognition systems that can incorporate visual and textual context hold particular social value for individuals with speech disorders
  • Low-Resource Scenario Adaptation: Model adaptation in low-resource scenarios such as disordered speech represents an important technical challenge

Limitations of Existing Approaches

  1. Complex Architecture Modifications: Most existing work requires modifying LLM architecture or using speech encoders to extract embeddings
  2. Vocabulary Expansion Costs: Some approaches handle audio by expanding the LLM vocabulary, increasing computational costs
  3. Limited Evaluation Metrics: Traditional ASR systems primarily rely on syntactic metrics like WER, with insufficient evaluation of semantic preservation
  4. Difficulty in Disordered Speech Adaptation: Traditional fine-tuning methods show limited effectiveness in adapting to disordered speech

Core Contributions

  1. Proposed LLM speech recognition method without architecture modification: By mapping audio tokens to low-frequency text tokens in the existing vocabulary, avoiding architectural changes
  2. Introduced RLHF-based ASR domain adaptation strategy: Using combined rewards of WER and meaning preservation (MP) scores for reinforcement learning optimization
  3. Achieved significant improvements in disordered speech recognition: Compared to supervised fine-tuning, the RLHF method achieved substantial performance improvements on the Euphonia dataset
  4. Provided new perspectives on semantic preservation evaluation: Comprehensive evaluation combining syntactic accuracy (WER) and semantic accuracy (MP)

Methodology Details

Task Definition

Input: Raw audio signals Output: Corresponding text transcriptions Constraints: Maintain the original LLM architecture unchanged while adapting to disordered speech domain

Model Architecture

Phase One: Building LLM Speech Recognition Capability

Audio Tokenization and Discretization:

  • Uses USM speech encoder (trained similarly to w2v-BERT) to generate tokens at 25Hz frequency
  • Extracts embeddings from intermediate layer (layer 16) and clusters into 1024 clusters
  • Maps audio embeddings to nearest cluster center IDs

Vocabulary Remapping:

  • Maps 1024 audio cluster IDs to the last 1024 lowest-frequency text tokens in the LLM vocabulary
  • Motivation for selecting low-frequency tokens: These are typically multilingual or Unicode characters that can be repurposed as audio tokens
  • Uses standard supervised fine-tuning on ASR data with discretized audio tokens as input and text transcriptions as output

Phase Two: Domain Adaptation via RLHF

Reward Function Design:

R(x,y;y*) = γ · MP(y,y*) + ln(1 - WER(y,y*))

Where:

  • x: Original input
  • y: Predicted transcription
  • y*: Ground truth transcription
  • γ: Hyperparameter balancing WER and MP scores
  • MP: Meaning preservation score
  • WER: Word error rate

Semantic Preservation Reward Model:

  • Trains Gemma-2B on semantic preservation binary classification task
  • Uses cross-entropy loss on 2,840 prediction-ground truth transcription pairs
  • Achieves 0.87 AUC on test set (compared to 0.89 AUC in reference 16)

Reinforcement Learning Optimization:

  • Uses PPO (Proximal Policy Optimization)
  • Employs gradient clipping and KL regularization
  • Selects optimal checkpoint through experiments with different γ values

Technical Innovations

  1. Architecture-free audio processing: Avoids complex architectural modifications by reusing existing vocabulary
  2. Multi-objective reward function: Combines syntactic (WER) and semantic (MP) accuracy to prevent reward hacking
  3. Progressive training strategy: Supervised fine-tuning on mixed data first, then RLHF for domain adaptation
  4. Semantic preservation evaluation: Introduces semantic evaluation metrics based on human preferences

Experimental Setup

Datasets

  1. LibriSpeech:
    • 1000 hours of standard speech data
    • Clean single-speaker recordings from English audiobooks
    • Uses dev-clean split for validation
  2. Euphonia:
    • Over 1 million disordered speech utterances (~1k hours)
    • From 1,246 speakers with different speech disorders
    • Training set: 900k+ utterances; Test set: 5,699 utterances (200 speakers); Validation set: 343 utterances (24 speakers)
    • Includes severity labels annotated by speech-language pathologists

Evaluation Metrics

  • WER (Word Error Rate): Syntactic accuracy metric
  • MP (Meaning Preservation): Semantic preservation score using LLM to judge whether predicted transcription preserves original meaning

Comparison Methods

  • Librispeech Only: Training only on LibriSpeech
  • 30:70 mixture: 30% Euphonia + 70% LibriSpeech mixed training
  • Continued SFT: Continued supervised fine-tuning on disordered speech
  • RLHF variants: Reinforcement learning methods with different γ values

Implementation Details

  • Base Model: Gemma 2B (256k vocabulary)
  • Learning Rate: 5×10^-6 with cosine decay
  • Optimizer: Adam
  • Input Dropout: 5×10^-2
  • Audio Clustering: 1024 clusters learned on LibriSpeech

Experimental Results

Main Results

Supervised Fine-tuning Phase:

Data Mixture RatioEuphonia Test WER↓Euphonia Test MP↑LibriSpeech Dev WER↓
LibriSpeech Only70.939.017.1
30:70 mixture50.448.217.2

The 30:70 mixture ratio achieves significant improvements on disordered speech while maintaining performance on standard speech.

RLHF Adaptation Results:

Fine-tuning StrategyEuphonia Test WER↓Euphonia Test MP↑LibriSpeech Dev WER↓
Base SFT model50.448.217.2
Continued SFT57.142.822.9
RLHF (γ=0.00)41.050.420.2
RLHF (γ=1.00)42.655.722.0

Ablation Studies

Impact of Different γ Values:

  • γ=0.00 (WER only): Lowest WER but lower MP scores
  • γ=0.25-0.50: Balance point between WER and MP
  • γ=1.00: Highest MP scores with slight WER increase but no statistical significance (p=0.54)

Severity-Level Analysis: The RLHF model shows MP score improvements across all severity levels, with more pronounced improvements on moderate and severe disordered speech.

Case Analysis

Ground TruthSeverityRLHF(γ=0.0)WERRLHF(γ=1.0)WER
"not so good today"MILD"not so good to the."0.5"not so good to day."0.5
"every one of my family listens to music"MODERATE"every once in my frame and listen to music"0.62"everybody in my family listens to music"0.38
"dancing is so much fun"MODERATE"that's so much fun."0.40"dancing so much fun."0.20

Human Evaluation

In human evaluation of 220 samples:

  • Average semantic preservation evaluation: 29.10% for γ=0.0 model, 40.45% for γ=1.0 model
  • Correlation with model evaluation: Spearman correlation coefficients of 0.684 and 0.639 respectively, both statistically significant

LLM-based ASR Research

  1. Architecture Modification Methods: Such as AudioPaLM, which implement speech processing by modifying LLM architecture
  2. Post-processing Methods: Early work primarily used LLMs to correct ASR system outputs
  3. End-to-End Methods: Recent work directly fine-tunes LLMs for speech recognition

Semantic Distance Metrics

  1. Limitations of Traditional Metrics: Syntactic metrics like WER cannot fully reflect semantic preservation
  2. BERTScore Extensions: Using pre-trained models to compute semantic similarity
  3. Human Preference Learning: Training semantic preservation judgment models based on expert annotations

Conclusions and Discussion

Main Conclusions

  1. RLHF significantly outperforms supervised fine-tuning: The RLHF method achieves substantial improvements over continued supervised fine-tuning in disordered speech adaptation tasks
  2. Effectiveness of multi-objective rewards: Reward functions combining WER and MP achieve good balance between syntactic and semantic accuracy
  3. Importance of semantic preservation: In disordered speech recognition, semantic preservation is more important than strict word matching

Limitations

  1. Overall performance constraints: This LLM method does not surpass existing specialized ASR systems
  2. Computational resource requirements: RLHF training requires additional computational resources and training time
  3. Language limitations: Experiments are conducted only in English; multilingual applicability remains unverified
  4. Model scale limitations: Experiments only on Gemma 2B; effectiveness on larger models remains unknown

Future Directions

  1. Verification on larger models: Validate method effectiveness on larger-scale LLMs
  2. Multilingual extension: Extend the method to disordered speech recognition in other languages
  3. Audio discretization improvements: Develop better audio token discretization strategies
  4. Multi-reward signal fusion: Explore possibilities of combining more reward signals

In-Depth Evaluation

Strengths

  1. Strong methodological innovation: The audio processing method without LLM architecture modification has practical value
  2. Well-designed experiments: The progressive training strategy from supervised fine-tuning to RLHF is reasonable
  3. Comprehensive evaluation system: Combines syntactic and semantic metrics with human evaluation verification
  4. Significant social value: Research on disordered speech has important social significance

Weaknesses

  1. Limited performance improvements: While relative improvements are significant, absolute performance still has room for enhancement
  2. Computational efficiency concerns: RLHF method has higher computational costs compared to direct fine-tuning
  3. Insufficient generalization verification: Validation on only two datasets; generalization requires further verification
  4. Missing theoretical analysis: Lacks theoretical explanation for why RLHF is more effective in this task

Impact

  1. Technical contribution: Provides new insights for LLM application in speech recognition tasks
  2. Application value: Provides valuable technical pathways for accessibility technology development
  3. Research inspiration: Demonstrates the potential of RLHF in specialized domain adaptation

Applicable Scenarios

  1. Disordered speech assistance: Can be applied to assistive communication systems for individuals with speech disorders
  2. Multimodal dialogue systems: Suitable for application scenarios requiring simultaneous speech and text processing
  3. Low-resource speech recognition: Provides reference value for specialized speech domains with scarce training data

References

The paper cites 35 relevant references covering important work in LLM multimodal extension, speech recognition, reinforcement learning, and other domains, providing a solid theoretical foundation for the research.


Overall Assessment: This paper has significant importance in both technical innovation and social value. The proposed architecture-free LLM speech recognition method and RLHF domain adaptation strategy provide new perspectives for related research. While there is room for improvement in absolute performance, the substantial improvements in the important application scenario of disordered speech recognition demonstrate the practical value of this method.