2025-11-15T20:37:12.035510

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Luu, Bojar

Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.

academic

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Basic Information

Paper ID: 2510.10329
Title: End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
Authors: Nam Luu, Ondřej Bojar (Charles University)
Category: cs.CL
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10329v1

Abstract

Research Background and Motivation

Problem Definition

This research aims to address efficiency and performance challenges in the Speech Translation (ST) task. Speech translation requires direct conversion of speech signals from one language to text in another language, traditionally employing either cascade approaches (ASR→MT) or end-to-end approaches.

Research Significance

Simplified Architecture: End-to-end methods can avoid intermediate ASR steps, simplifying the overall system architecture
Error Propagation: Cascade systems suffer from error propagation issues, where errors in the ASR stage affect subsequent translation quality
LLM Potential: Large language models demonstrate strong capabilities in natural language tasks, but their application in multimodal tasks requires further exploration

Limitations of Existing Methods

Data Scarcity: Parallel training data for speech translation is relatively scarce, particularly for low-resource languages
Model Efficiency: Existing end-to-end models face challenges in inference speed and model size
Performance Gap: End-to-end models still struggle to match cascade system performance in certain scenarios

Research Motivation

Combining the high-quality audio representation capabilities of pre-trained speech encoders with the powerful language processing abilities of LLMs to construct an end-to-end architecture capable of simultaneously executing ASR and ST tasks.

Core Contributions

Proposed an end-to-end architecture integrating speech foundational models and LLMs, capable of simultaneously executing automatic speech recognition and speech translation tasks
Designed effective modality adaptation mechanisms, including two length adapters: CTC folding and convolutional downsampling
Achieved superior translation performance compared to SeamlessM4T on English-German language pairs, approaching the performance of the Whisper+NLLB cascade system
Provided detailed experimental analysis, comparing the effects of different LLM and speech encoder combinations

Methodology Details

Task Definition

Input: Speech signals in the source language
Output: Simultaneously generate transcription text in the source language and translation text in the target language
Constraint: End-to-end training without intermediate supervision signals

Model Architecture

The overall architecture comprises three main components:

1. Speech Encoder

HuBERT: Uses the hubert-large-ls960-ft variant, trained on 60,000 hours of LibriLight data and fine-tuned on 960 hours of LibriSpeech data
Whisper Encoder: Uses the encoder portion of whisper-large-v3-turbo to extract audio hidden features

2. Length Adapter

Since speech feature sequences may exceed the maximum length supported by LLMs, compression is necessary:

CTC Folding (for HuBERT):
- Utilizes labels predicted by the CTC layer
- Averages vectors corresponding to repeated labels
- Effectively compresses sequence length while preserving semantic information
Convolutional Downsampling (for Whisper):
- Uses convolutional layers with kernel size=5, stride=5
- Directly performs 5x downsampling on feature sequences

3. Projection Layer

Single-layer feedforward network
Maps the hidden dimension of the speech encoder to the embedding dimension of the LLM
Ensures speech representations can be effectively integrated into the LLM's embedding space

4. Large Language Models

Experiments were conducted with four different pre-trained LLMs:

Gemma 7B (gemma-7b)
Gemma 2 9B (gemma-2-9b)
Llama 2 7B (Llama-2-7b-hf)
Mistral 7B v0.1 (Mistral-7B-v0.1)

Technical Innovations

Unified Multi-task Learning Framework: Enables simultaneous training and inference of ASR and ST through special delimiter tokens
Modality Adaptation Strategy: Designs specialized length compression methods for different speech encoders
Efficient Fine-tuning: Employs QLoRA (Quantized Low-Rank Adaptation) technology for parameter-efficient fine-tuning

Training Strategy

Data Format

<bos> <>audio<> {audio features} <>transcript<> {transcript} <>translation<> {translation} <eos>

Loss Computation

Cross-entropy loss is computed only for tokens following <>transcript<>
Trained using next-token-prediction approach

Inference Format

<bos> <>audio<> {audio features} <>transcript<>

The model autoregressively generates transcription and translation text.

Experimental Setup

Datasets

Training Data: MuST-C v1.0 English-German subset, approximately 400 hours of audio data
Test Data:
- MuST-C tst-COMMON v2.0 and v3.0
- IWSLT'21 and '22 offline track test sets
- LibriSpeech test-clean and test-other (for ASR evaluation)

Evaluation Metrics

Speech Translation: BLEU, COMET $^{DA}_{22}$ , COMET $^{KIWI-DA}_{22}$
Speech Recognition: WER (Word Error Rate)

Baseline Methods

Cascade System: Whisper (whisper-large-v3-turbo) + NLLB (nllb-200-3.3B)
End-to-end Baseline: SeamlessM4T (seamless-m4t-v2-large)

Implementation Details

Fine-tuning Method: 4-bit QLoRA, bfloat16 precision
LoRA Parameters: rank=8, alpha=8
Batch Size: 1 for HuBERT models, 2 for Whisper models
Optimizer: AdamW, learning rate 1e-4, cosine scheduler
Training Steps: 500,000 steps for HuBERT models, 100,000 steps for Whisper models

Experimental Results

Main Results

ASR Performance (WER %)

Model	MuST-C v2	MuST-C v3	IWSLT'22	LibriSpeech clean	LibriSpeech other
Whisper	6.7	7.7	11.8	4.1	7.2
Whisper enc. + Gemma 2 9B	8.2	8.1	22.6	8.0	13.7
HuBERT + Gemma 2 9B	11.1	12.5	21.9	8.4	13.1

Speech Translation Performance (BLEU Scores)

Model	MuST-C v2	MuST-C v3	IWSLT'21	IWSLT'22
Whisper + NLLB	39.84/31.06	40.30/31.60	43.84/-	41.86/30.48
SeamlessM4T	32.62/22.98	33.36/23.59	35.97/-	34.08/22.68
Whisper enc. + Gemma 2 9B	41.33/31.98	41.16/31.72	40.76/-	39.64/29.18

COMET Performance

The best model (Whisper enc. + Gemma 2 9B) on the COMET $^{DA}_{22}$ metric:

MuST-C v2: 84.22 (vs 83.00 cascade system)
MuST-C v3: 83.65 (vs 82.49 cascade system)
Approximately 8% improvement compared to SeamlessM4T

Ablation Study Findings

LLM Selection: Gemma 2 9B demonstrates the best performance across all tests
Encoder Comparison: Whisper encoder consistently outperforms HuBERT
Adapter Effectiveness: Both CTC folding and convolutional downsampling effectively compress sequence length

Experimental Findings

End-to-end vs Cascade: The best end-to-end model can approach or even exceed cascade system performance
Model Scale: Larger LLMs (Gemma 2 9B) yield better performance
Speech Representation: The quality of pre-trained speech encoders directly impacts final performance

Speech Translation Research Directions

Cascade Methods: Traditional ASR+MT pipeline, still the mainstream approach
End-to-end Methods: Direct conversion from speech to target language text, avoiding intermediate representations
Multimodal LLMs: Recent research extending LLMs to other modalities such as speech

Unified Framework: Addresses both ASR and ST tasks simultaneously, rather than optimizing for a single task
Modular Design: Allows flexible replacement of different speech encoder and LLM components
Practicality: Provides an end-to-end solution while maintaining competitive performance

Conclusions and Discussion

Main Conclusions

An end-to-end architecture integrating pre-trained speech encoders and LLMs achieves competitive performance on English-German speech translation tasks
The best model not only surpasses SeamlessM4T but also approaches the performance of the Whisper+NLLB cascade system
The model can simultaneously execute ASR and ST tasks, providing a unified solution

Limitations

Data Constraints: Validation only on high-resource English-German language pairs; effectiveness on low-resource languages remains unknown
Computational Efficiency: Slower inference speed and larger model size compared to baseline models
ASR Performance: Still lags behind specialized Whisper models in speech recognition tasks
Training Data: The MuST-C dataset is relatively small (400 hours), potentially limiting model potential

Future Directions

Language Pair Expansion: Validate effectiveness across more language directions
Model Compression: Reduce model size through techniques such as knowledge distillation
Adapter Improvements: Explore more advanced modality adaptation methods such as Q-Former
Reinforcement Learning: Integrate RL techniques to further optimize performance

In-depth Evaluation

Strengths

Innovative Architecture: Effectively combines the advantages of speech foundational models and LLMs
Comprehensive Experiments: Systematic comparison of various encoder and LLM combinations
Practical Value: Provides a unified end-to-end solution
Technical Details: Detailed description of modality adaptation and training strategies
Openness: Uses open-source models, facilitating reproducibility

Weaknesses

Language Coverage: Validation only on a single English-German language pair, limited generalizability
Computational Cost: Lacks detailed analysis of training and inference computational overhead
Error Analysis: Insufficient in-depth analysis of model failure cases
Theoretical Analysis: Lacks theoretical explanation for why this architecture is effective
Data Sensitivity: Insufficient analysis of model sensitivity to training data scale

Impact

Academic Contribution: Provides a new end-to-end solution for the speech translation field
Practical Value: Applicable to real-world multilingual speech processing systems
Reproducibility: Uses open-source components, facilitating subsequent research
Inspirational Value: Provides valuable exploration for multimodal LLM applications

Applicable Scenarios

Multilingual Conferences: Real-time speech translation and transcription
Educational Platforms: Automatic subtitles and translation for multilingual online courses
Customer Service: Cross-lingual speech interaction systems
Media Processing: Automatic transcription and translation of audio content

References

The paper cites important works in speech translation, large language models, and multimodal learning, including:

Whisper (Radford et al., 2022): Powerful speech recognition foundational model
SeamlessM4T (Communication et al., 2023): Multimodal translation model baseline
MuST-C (Cattoni et al., 2021): Standard speech translation dataset
QLoRA (Dettmers et al., 2023): Parameter-efficient fine-tuning technique

This paper proposes a promising end-to-end solution in the speech translation field. While there remains room for improvement in certain aspects, it provides valuable exploration and empirical results for multimodal LLM applications.