2025-11-15T20:37:12.035510

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Luu, Bojar
Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.
academic

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Basic Information

  • Paper ID: 2510.10329
  • Title: End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
  • Authors: Nam Luu, Ondřej Bojar (Charles University)
  • Category: cs.CL
  • Publication Date: October 11, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10329v1

Abstract

Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in COMET22DA\text{COMET}^{\text{DA}}_{22} metric.

Research Background and Motivation

Problem Definition

This research aims to address efficiency and performance challenges in the Speech Translation (ST) task. Speech translation requires direct conversion of speech signals from one language to text in another language, traditionally employing either cascade approaches (ASR→MT) or end-to-end approaches.

Research Significance

  1. Simplified Architecture: End-to-end methods can avoid intermediate ASR steps, simplifying the overall system architecture
  2. Error Propagation: Cascade systems suffer from error propagation issues, where errors in the ASR stage affect subsequent translation quality
  3. LLM Potential: Large language models demonstrate strong capabilities in natural language tasks, but their application in multimodal tasks requires further exploration

Limitations of Existing Methods

  1. Data Scarcity: Parallel training data for speech translation is relatively scarce, particularly for low-resource languages
  2. Model Efficiency: Existing end-to-end models face challenges in inference speed and model size
  3. Performance Gap: End-to-end models still struggle to match cascade system performance in certain scenarios

Research Motivation

Combining the high-quality audio representation capabilities of pre-trained speech encoders with the powerful language processing abilities of LLMs to construct an end-to-end architecture capable of simultaneously executing ASR and ST tasks.

Core Contributions

  1. Proposed an end-to-end architecture integrating speech foundational models and LLMs, capable of simultaneously executing automatic speech recognition and speech translation tasks
  2. Designed effective modality adaptation mechanisms, including two length adapters: CTC folding and convolutional downsampling
  3. Achieved superior translation performance compared to SeamlessM4T on English-German language pairs, approaching the performance of the Whisper+NLLB cascade system
  4. Provided detailed experimental analysis, comparing the effects of different LLM and speech encoder combinations

Methodology Details

Task Definition

  • Input: Speech signals in the source language
  • Output: Simultaneously generate transcription text in the source language and translation text in the target language
  • Constraint: End-to-end training without intermediate supervision signals

Model Architecture

The overall architecture comprises three main components:

1. Speech Encoder

  • HuBERT: Uses the hubert-large-ls960-ft variant, trained on 60,000 hours of LibriLight data and fine-tuned on 960 hours of LibriSpeech data
  • Whisper Encoder: Uses the encoder portion of whisper-large-v3-turbo to extract audio hidden features

2. Length Adapter

Since speech feature sequences may exceed the maximum length supported by LLMs, compression is necessary:

  • CTC Folding (for HuBERT):
    • Utilizes labels predicted by the CTC layer
    • Averages vectors corresponding to repeated labels
    • Effectively compresses sequence length while preserving semantic information
  • Convolutional Downsampling (for Whisper):
    • Uses convolutional layers with kernel size=5, stride=5
    • Directly performs 5x downsampling on feature sequences

3. Projection Layer

  • Single-layer feedforward network
  • Maps the hidden dimension of the speech encoder to the embedding dimension of the LLM
  • Ensures speech representations can be effectively integrated into the LLM's embedding space

4. Large Language Models

Experiments were conducted with four different pre-trained LLMs:

  • Gemma 7B (gemma-7b)
  • Gemma 2 9B (gemma-2-9b)
  • Llama 2 7B (Llama-2-7b-hf)
  • Mistral 7B v0.1 (Mistral-7B-v0.1)

Technical Innovations

  1. Unified Multi-task Learning Framework: Enables simultaneous training and inference of ASR and ST through special delimiter tokens
  2. Modality Adaptation Strategy: Designs specialized length compression methods for different speech encoders
  3. Efficient Fine-tuning: Employs QLoRA (Quantized Low-Rank Adaptation) technology for parameter-efficient fine-tuning

Training Strategy

Data Format

<bos> <>audio<> {audio features} <>transcript<> {transcript} <>translation<> {translation} <eos>

Loss Computation

  • Cross-entropy loss is computed only for tokens following <>transcript<>
  • Trained using next-token-prediction approach

Inference Format

<bos> <>audio<> {audio features} <>transcript<>

The model autoregressively generates transcription and translation text.

Experimental Setup

Datasets

  • Training Data: MuST-C v1.0 English-German subset, approximately 400 hours of audio data
  • Test Data:
    • MuST-C tst-COMMON v2.0 and v3.0
    • IWSLT'21 and '22 offline track test sets
    • LibriSpeech test-clean and test-other (for ASR evaluation)

Evaluation Metrics

  • Speech Translation: BLEU, COMET22DA^{DA}_{22}, COMET22KIWIDA^{KIWI-DA}_{22}
  • Speech Recognition: WER (Word Error Rate)

Baseline Methods

  • Cascade System: Whisper (whisper-large-v3-turbo) + NLLB (nllb-200-3.3B)
  • End-to-end Baseline: SeamlessM4T (seamless-m4t-v2-large)

Implementation Details

  • Fine-tuning Method: 4-bit QLoRA, bfloat16 precision
  • LoRA Parameters: rank=8, alpha=8
  • Batch Size: 1 for HuBERT models, 2 for Whisper models
  • Optimizer: AdamW, learning rate 1e-4, cosine scheduler
  • Training Steps: 500,000 steps for HuBERT models, 100,000 steps for Whisper models

Experimental Results

Main Results

ASR Performance (WER %)

ModelMuST-C v2MuST-C v3IWSLT'22LibriSpeech cleanLibriSpeech other
Whisper6.77.711.84.17.2
Whisper enc. + Gemma 2 9B8.28.122.68.013.7
HuBERT + Gemma 2 9B11.112.521.98.413.1

Speech Translation Performance (BLEU Scores)

ModelMuST-C v2MuST-C v3IWSLT'21IWSLT'22
Whisper + NLLB39.84/31.0640.30/31.6043.84/-41.86/30.48
SeamlessM4T32.62/22.9833.36/23.5935.97/-34.08/22.68
Whisper enc. + Gemma 2 9B41.33/31.9841.16/31.7240.76/-39.64/29.18

COMET Performance

The best model (Whisper enc. + Gemma 2 9B) on the COMET22DA^{DA}_{22} metric:

  • MuST-C v2: 84.22 (vs 83.00 cascade system)
  • MuST-C v3: 83.65 (vs 82.49 cascade system)
  • Approximately 8% improvement compared to SeamlessM4T

Ablation Study Findings

  1. LLM Selection: Gemma 2 9B demonstrates the best performance across all tests
  2. Encoder Comparison: Whisper encoder consistently outperforms HuBERT
  3. Adapter Effectiveness: Both CTC folding and convolutional downsampling effectively compress sequence length

Experimental Findings

  1. End-to-end vs Cascade: The best end-to-end model can approach or even exceed cascade system performance
  2. Model Scale: Larger LLMs (Gemma 2 9B) yield better performance
  3. Speech Representation: The quality of pre-trained speech encoders directly impacts final performance

Speech Translation Research Directions

  1. Cascade Methods: Traditional ASR+MT pipeline, still the mainstream approach
  2. End-to-end Methods: Direct conversion from speech to target language text, avoiding intermediate representations
  3. Multimodal LLMs: Recent research extending LLMs to other modalities such as speech
  1. Unified Framework: Addresses both ASR and ST tasks simultaneously, rather than optimizing for a single task
  2. Modular Design: Allows flexible replacement of different speech encoder and LLM components
  3. Practicality: Provides an end-to-end solution while maintaining competitive performance

Conclusions and Discussion

Main Conclusions

  1. An end-to-end architecture integrating pre-trained speech encoders and LLMs achieves competitive performance on English-German speech translation tasks
  2. The best model not only surpasses SeamlessM4T but also approaches the performance of the Whisper+NLLB cascade system
  3. The model can simultaneously execute ASR and ST tasks, providing a unified solution

Limitations

  1. Data Constraints: Validation only on high-resource English-German language pairs; effectiveness on low-resource languages remains unknown
  2. Computational Efficiency: Slower inference speed and larger model size compared to baseline models
  3. ASR Performance: Still lags behind specialized Whisper models in speech recognition tasks
  4. Training Data: The MuST-C dataset is relatively small (400 hours), potentially limiting model potential

Future Directions

  1. Language Pair Expansion: Validate effectiveness across more language directions
  2. Model Compression: Reduce model size through techniques such as knowledge distillation
  3. Adapter Improvements: Explore more advanced modality adaptation methods such as Q-Former
  4. Reinforcement Learning: Integrate RL techniques to further optimize performance

In-depth Evaluation

Strengths

  1. Innovative Architecture: Effectively combines the advantages of speech foundational models and LLMs
  2. Comprehensive Experiments: Systematic comparison of various encoder and LLM combinations
  3. Practical Value: Provides a unified end-to-end solution
  4. Technical Details: Detailed description of modality adaptation and training strategies
  5. Openness: Uses open-source models, facilitating reproducibility

Weaknesses

  1. Language Coverage: Validation only on a single English-German language pair, limited generalizability
  2. Computational Cost: Lacks detailed analysis of training and inference computational overhead
  3. Error Analysis: Insufficient in-depth analysis of model failure cases
  4. Theoretical Analysis: Lacks theoretical explanation for why this architecture is effective
  5. Data Sensitivity: Insufficient analysis of model sensitivity to training data scale

Impact

  1. Academic Contribution: Provides a new end-to-end solution for the speech translation field
  2. Practical Value: Applicable to real-world multilingual speech processing systems
  3. Reproducibility: Uses open-source components, facilitating subsequent research
  4. Inspirational Value: Provides valuable exploration for multimodal LLM applications

Applicable Scenarios

  1. Multilingual Conferences: Real-time speech translation and transcription
  2. Educational Platforms: Automatic subtitles and translation for multilingual online courses
  3. Customer Service: Cross-lingual speech interaction systems
  4. Media Processing: Automatic transcription and translation of audio content

References

The paper cites important works in speech translation, large language models, and multimodal learning, including:

  • Whisper (Radford et al., 2022): Powerful speech recognition foundational model
  • SeamlessM4T (Communication et al., 2023): Multimodal translation model baseline
  • MuST-C (Cattoni et al., 2021): Standard speech translation dataset
  • QLoRA (Dettmers et al., 2023): Parameter-efficient fine-tuning technique

This paper proposes a promising end-to-end solution in the speech translation field. While there remains room for improvement in certain aspects, it provides valuable exploration and empirical results for multimodal LLM applications.