End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
Luu, Bojar
Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.
academic
End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in COMET22DA metric.
This research aims to address efficiency and performance challenges in the Speech Translation (ST) task. Speech translation requires direct conversion of speech signals from one language to text in another language, traditionally employing either cascade approaches (ASR→MT) or end-to-end approaches.
Simplified Architecture: End-to-end methods can avoid intermediate ASR steps, simplifying the overall system architecture
Error Propagation: Cascade systems suffer from error propagation issues, where errors in the ASR stage affect subsequent translation quality
LLM Potential: Large language models demonstrate strong capabilities in natural language tasks, but their application in multimodal tasks requires further exploration
Combining the high-quality audio representation capabilities of pre-trained speech encoders with the powerful language processing abilities of LLMs to construct an end-to-end architecture capable of simultaneously executing ASR and ST tasks.
Proposed an end-to-end architecture integrating speech foundational models and LLMs, capable of simultaneously executing automatic speech recognition and speech translation tasks
Designed effective modality adaptation mechanisms, including two length adapters: CTC folding and convolutional downsampling
Achieved superior translation performance compared to SeamlessM4T on English-German language pairs, approaching the performance of the Whisper+NLLB cascade system
Provided detailed experimental analysis, comparing the effects of different LLM and speech encoder combinations
The paper cites important works in speech translation, large language models, and multimodal learning, including:
Whisper (Radford et al., 2022): Powerful speech recognition foundational model
SeamlessM4T (Communication et al., 2023): Multimodal translation model baseline
MuST-C (Cattoni et al., 2021): Standard speech translation dataset
QLoRA (Dettmers et al., 2023): Parameter-efficient fine-tuning technique
This paper proposes a promising end-to-end solution in the speech translation field. While there remains room for improvement in certain aspects, it provides valuable exploration and empirical results for multimodal LLM applications.