DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation
Yang, Nakamura
Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.
academic
DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation
Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. While pre-trained segmentation models such as SHAS outperform heuristic rules, they remain constrained by supervised learning objectives and lack alignment with human preferences. This paper proposes a segmentation framework for large language models trained with Direct Preference Optimization (DPO), enabling LLMs to predict more natural segmentation points through preference alignment. Evaluation on three language pairs using the ACL 60/60 corpus with SeamlessM4T v2 as the translation backbone demonstrates that DPO-tuned LLMs surpass SHAS in segmentation accuracy, with sustained improvements in both translation quality (BLEU, COMET) and latency (average lagging).
The fundamental challenge in simultaneous speech translation (SimulST) is minimizing latency while maintaining translation quality, requiring the system to accurately determine when to segment the input stream and output translations. Improper segmentation results in incomplete or redundant translation units, severely impacting accuracy and user experience.
Segmentation is considered a core component of practical SimulST systems, particularly in streaming SimulST, where improper boundaries significantly degrade translation quality and latency. Traditional heuristic rules (such as punctuation prediction and fixed-length chunking), while simple and efficient, often fail to adapt to diverse linguistic structures and speaking styles.
Heuristic approaches: Fixed wait-k strategies and similar methods have limited adaptability to linguistic variation
Pre-trained models: Models like SHAS, while more robust than heuristic methods, remain constrained by supervised learning objectives and rely solely on acoustic features
Lack of human preference alignment: Existing methods do not incorporate alignment with machine translation performance, which is crucial for natural and timely translation
Large language models have demonstrated superior generalization capabilities in speech and translation tasks, but their potential in SimulST segmentation remains underexplored. Direct Preference Optimization (DPO) offers a promising direction for aligning models with human feedback, enabling preference-guided decision-making beyond supervised training.
Proposes a DPO-optimized LLM segmentation framework: First application of preference optimization to SimulST segmentation tasks
Constructs comprehensive experimental evaluation: Evaluation on three language pairs using the ACL 60/60 dataset with SeamlessM4T v2 as the translation backbone
Demonstrates superiority of preference-tuned LLMs: Improvements in both translation quality and latency compared to the pre-trained segmentation model SHAS
Provides complete end-to-end system: Integrates segmentation module with translation system for real-time simultaneous speech translation
The segmentation task in SimulST is defined as predicting sentence boundaries in the incoming speech stream, with the objective of balancing translation quality and latency. Given a streaming input speech sequence x, the model produces a sequence of segmentation decisions {s₁, s₂, ..., sₜ}, where each sₜ represents a predicted boundary position. Unlike binary classification approaches, this work formulates segmentation as a next-boundary prediction problem.
Employs Qwen2.5-Omni-3B as the segmentation backbone model, operating in a streaming fashion with a sliding window mechanism over speech input. The model directly processes chunk-level acoustic features rather than token-level ASR transcriptions, incrementally predicting the next segmentation point given the current speech context.
Direct Preference Optimization is employed to fine-tune the LLM:
Given an input utterance x, multiple candidate segmentations are generated, with each segmentation y represented as a sequence of boundary indices on the input stream. Preference pairs (y_pref, y_dispref) are constructed, where y_pref represents the preferred segmentation yielding better translation quality and lower latency.
where π_θ represents the policy induced by the LLM and β is a scaling hyperparameter. Training proceeds for 5 epochs using standard learning rate scheduling.
Analysis via latency-quality trade-off curves demonstrates that DPO-trained LLMs consistently outperform other segmentation strategies across the entire operating range, achieving higher BLEU scores at similar or lower latency.
Large-scale multilingual multimodal translation systems such as SeamlessM4T provide strong backbones for speech translation tasks, demonstrating state-of-the-art performance across multiple languages.
DPO effectiveness: Preference optimization enables models to learn segmentation aligned with human preferences, producing more natural boundaries and better quality-latency trade-offs
Performance improvements: Consistent improvements across three language directions compared to SHAS at approximately 3 seconds latency
Practical value: Demonstrates the potential of preference-tuned LLMs in real-time simultaneous interpretation
The paper cites important works in related fields, including:
SHAS segmentation model Tsiamas et al., 2022
SeamlessM4T translation system Meta AI, 2023-2024
DPO optimization method Rafailov et al., 2023
ACL 60/60 evaluation benchmark Salesky et al., 2023
Overall Assessment: This is a technically innovative paper that for the first time introduces preference optimization to SimulST segmentation tasks. The methodology is sound and experimental results are convincing. While there is room for improvement in theoretical analysis and computational efficiency, the paper makes valuable contributions to the field and opens new research directions.