2025-11-13T12:49:11.039710

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning

Nagpal, Venugopalan, Tobin et al.

We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.

academic

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning

Basic Information

Paper ID: 2501.00039
Title: Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning
Authors: Chirag Nagpal, Subhashini Venugopalan, Jimmy Tobin, Marilyn Ladewig, Katherine Heller, Katrin Tomanek (Google Research)
Classification: eess.AS cs.CL cs.LG cs.SD
Publication Date: December 25, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.00039

Abstract

This paper proposes a large language model (LLM) capable of processing speech input and demonstrates that further fine-tuning through reinforcement learning from human feedback (RLHF) can better adapt to disordered speech compared to traditional supervised fine-tuning. The method replaces low-frequency text tokens in the LLM vocabulary with audio tokens, enabling speech recognition through fine-tuning on speech transcription data. Subsequently, reinforcement learning with rewards based on syntactic and semantic accuracy metrics is employed to further generalize the LLM for recognizing disordered speech. While the resulting model does not surpass existing systems in speech recognition, the study finds that reinforcement learning fine-tuning with custom rewards significantly outperforms supervised fine-tuning of language models when adapting to speech in different settings.

Research Background and Motivation

Problem Definition

This research addresses two core challenges:

How to enable existing LLMs to process speech input and perform speech recognition
How to effectively adapt LLM-based ASR systems to disordered speech recognition tasks

Significance

Multimodal Capability Extension: Enhancing LLMs' audio processing capabilities while maintaining language understanding is crucial for speech-controlled automation applications
Accessibility Technology: Speech recognition systems that can incorporate visual and textual context hold particular social value for individuals with speech disorders
Low-Resource Scenario Adaptation: Model adaptation in low-resource scenarios such as disordered speech represents an important technical challenge

Limitations of Existing Approaches

Complex Architecture Modifications: Most existing work requires modifying LLM architecture or using speech encoders to extract embeddings
Vocabulary Expansion Costs: Some approaches handle audio by expanding the LLM vocabulary, increasing computational costs
Limited Evaluation Metrics: Traditional ASR systems primarily rely on syntactic metrics like WER, with insufficient evaluation of semantic preservation
Difficulty in Disordered Speech Adaptation: Traditional fine-tuning methods show limited effectiveness in adapting to disordered speech

Core Contributions

Proposed LLM speech recognition method without architecture modification: By mapping audio tokens to low-frequency text tokens in the existing vocabulary, avoiding architectural changes
Introduced RLHF-based ASR domain adaptation strategy: Using combined rewards of WER and meaning preservation (MP) scores for reinforcement learning optimization
Achieved significant improvements in disordered speech recognition: Compared to supervised fine-tuning, the RLHF method achieved substantial performance improvements on the Euphonia dataset
Provided new perspectives on semantic preservation evaluation: Comprehensive evaluation combining syntactic accuracy (WER) and semantic accuracy (MP)

Methodology Details

Task Definition

Input: Raw audio signals Output: Corresponding text transcriptions Constraints: Maintain the original LLM architecture unchanged while adapting to disordered speech domain

Model Architecture

Phase One: Building LLM Speech Recognition Capability

Audio Tokenization and Discretization:

Uses USM speech encoder (trained similarly to w2v-BERT) to generate tokens at 25Hz frequency
Extracts embeddings from intermediate layer (layer 16) and clusters into 1024 clusters
Maps audio embeddings to nearest cluster center IDs

Vocabulary Remapping:

Maps 1024 audio cluster IDs to the last 1024 lowest-frequency text tokens in the LLM vocabulary
Motivation for selecting low-frequency tokens: These are typically multilingual or Unicode characters that can be repurposed as audio tokens
Uses standard supervised fine-tuning on ASR data with discretized audio tokens as input and text transcriptions as output

Phase Two: Domain Adaptation via RLHF

Reward Function Design:

R(x,y;y*) = γ · MP(y,y*) + ln(1 - WER(y,y*))

Where:

x: Original input
y: Predicted transcription
y*: Ground truth transcription
γ: Hyperparameter balancing WER and MP scores
MP: Meaning preservation score
WER: Word error rate

Semantic Preservation Reward Model:

Trains Gemma-2B on semantic preservation binary classification task
Uses cross-entropy loss on 2,840 prediction-ground truth transcription pairs
Achieves 0.87 AUC on test set (compared to 0.89 AUC in reference 16)

Reinforcement Learning Optimization:

Uses PPO (Proximal Policy Optimization)
Employs gradient clipping and KL regularization
Selects optimal checkpoint through experiments with different γ values

Technical Innovations

Architecture-free audio processing: Avoids complex architectural modifications by reusing existing vocabulary
Multi-objective reward function: Combines syntactic (WER) and semantic (MP) accuracy to prevent reward hacking
Progressive training strategy: Supervised fine-tuning on mixed data first, then RLHF for domain adaptation
Semantic preservation evaluation: Introduces semantic evaluation metrics based on human preferences

Experimental Setup

Datasets

LibriSpeech:
- 1000 hours of standard speech data
- Clean single-speaker recordings from English audiobooks
- Uses dev-clean split for validation
Euphonia:
- Over 1 million disordered speech utterances (~1k hours)
- From 1,246 speakers with different speech disorders
- Training set: 900k+ utterances; Test set: 5,699 utterances (200 speakers); Validation set: 343 utterances (24 speakers)
- Includes severity labels annotated by speech-language pathologists

Evaluation Metrics

WER (Word Error Rate): Syntactic accuracy metric
MP (Meaning Preservation): Semantic preservation score using LLM to judge whether predicted transcription preserves original meaning

Comparison Methods

Librispeech Only: Training only on LibriSpeech
30:70 mixture: 30% Euphonia + 70% LibriSpeech mixed training
Continued SFT: Continued supervised fine-tuning on disordered speech
RLHF variants: Reinforcement learning methods with different γ values

Implementation Details

Base Model: Gemma 2B (256k vocabulary)
Learning Rate: 5×10^-6 with cosine decay
Optimizer: Adam
Input Dropout: 5×10^-2
Audio Clustering: 1024 clusters learned on LibriSpeech

Experimental Results

Main Results

Supervised Fine-tuning Phase:

Data Mixture Ratio	Euphonia Test WER↓	Euphonia Test MP↑	LibriSpeech Dev WER↓
LibriSpeech Only	70.9	39.0	17.1
30:70 mixture	50.4	48.2	17.2

The 30:70 mixture ratio achieves significant improvements on disordered speech while maintaining performance on standard speech.

RLHF Adaptation Results:

Fine-tuning Strategy	Euphonia Test WER↓	Euphonia Test MP↑	LibriSpeech Dev WER↓
Base SFT model	50.4	48.2	17.2
Continued SFT	57.1	42.8	22.9
RLHF (γ=0.00)	41.0	50.4	20.2
RLHF (γ=1.00)	42.6	55.7	22.0

Ablation Studies

Impact of Different γ Values:

γ=0.00 (WER only): Lowest WER but lower MP scores
γ=0.25-0.50: Balance point between WER and MP
γ=1.00: Highest MP scores with slight WER increase but no statistical significance (p=0.54)

Severity-Level Analysis: The RLHF model shows MP score improvements across all severity levels, with more pronounced improvements on moderate and severe disordered speech.

Case Analysis

Ground Truth	Severity	RLHF(γ=0.0)	WER	RLHF(γ=1.0)	WER
"not so good today"	MILD	"not so good to the."	0.5	"not so good to day."	0.5
"every one of my family listens to music"	MODERATE	"every once in my frame and listen to music"	0.62	"everybody in my family listens to music"	0.38
"dancing is so much fun"	MODERATE	"that's so much fun."	0.40	"dancing so much fun."	0.20

Human Evaluation

In human evaluation of 220 samples:

Average semantic preservation evaluation: 29.10% for γ=0.0 model, 40.45% for γ=1.0 model
Correlation with model evaluation: Spearman correlation coefficients of 0.684 and 0.639 respectively, both statistically significant

LLM-based ASR Research

Architecture Modification Methods: Such as AudioPaLM, which implement speech processing by modifying LLM architecture
Post-processing Methods: Early work primarily used LLMs to correct ASR system outputs
End-to-End Methods: Recent work directly fine-tunes LLMs for speech recognition

Semantic Distance Metrics

Limitations of Traditional Metrics: Syntactic metrics like WER cannot fully reflect semantic preservation
BERTScore Extensions: Using pre-trained models to compute semantic similarity
Human Preference Learning: Training semantic preservation judgment models based on expert annotations

Conclusions and Discussion

Main Conclusions

RLHF significantly outperforms supervised fine-tuning: The RLHF method achieves substantial improvements over continued supervised fine-tuning in disordered speech adaptation tasks
Effectiveness of multi-objective rewards: Reward functions combining WER and MP achieve good balance between syntactic and semantic accuracy
Importance of semantic preservation: In disordered speech recognition, semantic preservation is more important than strict word matching

Limitations

Overall performance constraints: This LLM method does not surpass existing specialized ASR systems
Computational resource requirements: RLHF training requires additional computational resources and training time
Language limitations: Experiments are conducted only in English; multilingual applicability remains unverified
Model scale limitations: Experiments only on Gemma 2B; effectiveness on larger models remains unknown

Future Directions

Verification on larger models: Validate method effectiveness on larger-scale LLMs
Multilingual extension: Extend the method to disordered speech recognition in other languages
Audio discretization improvements: Develop better audio token discretization strategies
Multi-reward signal fusion: Explore possibilities of combining more reward signals

In-Depth Evaluation

Strengths

Strong methodological innovation: The audio processing method without LLM architecture modification has practical value
Well-designed experiments: The progressive training strategy from supervised fine-tuning to RLHF is reasonable
Comprehensive evaluation system: Combines syntactic and semantic metrics with human evaluation verification
Significant social value: Research on disordered speech has important social significance

Weaknesses

Limited performance improvements: While relative improvements are significant, absolute performance still has room for enhancement
Computational efficiency concerns: RLHF method has higher computational costs compared to direct fine-tuning
Insufficient generalization verification: Validation on only two datasets; generalization requires further verification
Missing theoretical analysis: Lacks theoretical explanation for why RLHF is more effective in this task

Impact

Technical contribution: Provides new insights for LLM application in speech recognition tasks
Application value: Provides valuable technical pathways for accessibility technology development
Research inspiration: Demonstrates the potential of RLHF in specialized domain adaptation

Applicable Scenarios

Disordered speech assistance: Can be applied to assistive communication systems for individuals with speech disorders
Multimodal dialogue systems: Suitable for application scenarios requiring simultaneous speech and text processing
Low-resource speech recognition: Provides reference value for specialized speech domains with scarce training data

References

The paper cites 35 relevant references covering important work in LLM multimodal extension, speech recognition, reinforcement learning, and other domains, providing a solid theoretical foundation for the research.

Overall Assessment: This paper has significant importance in both technical innovation and social value. The proposed architecture-free LLM speech recognition method and RLHF domain adaptation strategy provide new perspectives for related research. While there is room for improvement in absolute performance, the substantial improvements in the important application scenario of disordered speech recognition demonstrate the practical value of this method.