2025-11-24T06:04:17.956351

DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

Yang, Nakamura

Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.

academic

DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

Basic Information

Paper ID: 2510.12195
Title: DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation
Authors: Zeyu Yang (CUHK, Shenzhen), Satoshi Nakamura (CUHK, Shenzhen & NAIST, Japan)
Classification: cs.CL (Computational Linguistics)
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12195

Abstract

Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. While pre-trained segmentation models such as SHAS outperform heuristic rules, they remain constrained by supervised learning objectives and lack alignment with human preferences. This paper proposes a segmentation framework for large language models trained with Direct Preference Optimization (DPO), enabling LLMs to predict more natural segmentation points through preference alignment. Evaluation on three language pairs using the ACL 60/60 corpus with SeamlessM4T v2 as the translation backbone demonstrates that DPO-tuned LLMs surpass SHAS in segmentation accuracy, with sustained improvements in both translation quality (BLEU, COMET) and latency (average lagging).

Research Background and Motivation

Core Problem

The fundamental challenge in simultaneous speech translation (SimulST) is minimizing latency while maintaining translation quality, requiring the system to accurately determine when to segment the input stream and output translations. Improper segmentation results in incomplete or redundant translation units, severely impacting accuracy and user experience.

Problem Significance

Segmentation is considered a core component of practical SimulST systems, particularly in streaming SimulST, where improper boundaries significantly degrade translation quality and latency. Traditional heuristic rules (such as punctuation prediction and fixed-length chunking), while simple and efficient, often fail to adapt to diverse linguistic structures and speaking styles.

Limitations of Existing Methods

Heuristic approaches: Fixed wait-k strategies and similar methods have limited adaptability to linguistic variation
Pre-trained models: Models like SHAS, while more robust than heuristic methods, remain constrained by supervised learning objectives and rely solely on acoustic features
Lack of human preference alignment: Existing methods do not incorporate alignment with machine translation performance, which is crucial for natural and timely translation

Research Motivation

Large language models have demonstrated superior generalization capabilities in speech and translation tasks, but their potential in SimulST segmentation remains underexplored. Direct Preference Optimization (DPO) offers a promising direction for aligning models with human feedback, enabling preference-guided decision-making beyond supervised training.

Core Contributions

Proposes a DPO-optimized LLM segmentation framework: First application of preference optimization to SimulST segmentation tasks
Constructs comprehensive experimental evaluation: Evaluation on three language pairs using the ACL 60/60 dataset with SeamlessM4T v2 as the translation backbone
Demonstrates superiority of preference-tuned LLMs: Improvements in both translation quality and latency compared to the pre-trained segmentation model SHAS
Provides complete end-to-end system: Integrates segmentation module with translation system for real-time simultaneous speech translation

Methodology

Task Definition

The segmentation task in SimulST is defined as predicting sentence boundaries in the incoming speech stream, with the objective of balancing translation quality and latency. Given a streaming input speech sequence x, the model produces a sequence of segmentation decisions {s₁, s₂, ..., sₜ}, where each sₜ represents a predicted boundary position. Unlike binary classification approaches, this work formulates segmentation as a next-boundary prediction problem.

Model Architecture

Base LLM

Employs Qwen2.5-Omni-3B as the segmentation backbone model, operating in a streaming fashion with a sliding window mechanism over speech input. The model directly processes chunk-level acoustic features rather than token-level ASR transcriptions, incrementally predicting the next segmentation point given the current speech context.

Preference Pair Construction

To incorporate human alignment signals, preference pairs of candidate segmentations are constructed:

Generate candidate boundaries by combining multiple heuristic and pre-trained strategies (VAD, fixed-length segmentation, SHAS output)
Evaluate each candidate segmentation using translation quality (BLEU) and latency (average lagging)
Derive ranking signals from these metrics, with better-performing segmentations as preferred candidates
Obtain approximately 8,000 preference pairs for training

DPO Training

Direct Preference Optimization is employed to fine-tune the LLM:

Given an input utterance x, multiple candidate segmentations are generated, with each segmentation y represented as a sequence of boundary indices on the input stream. Preference pairs (y_pref, y_dispref) are constructed, where y_pref represents the preferred segmentation yielding better translation quality and lower latency.

The DPO objective function is:

L(θ) = -E_{(x,y_pref,y_dispref)} [log σ(β · (log π_θ(y_pref | x) - log π_θ(y_dispref | x)))]

where π_θ represents the policy induced by the LLM and β is a scaling hyperparameter. Training proceeds for 5 epochs using standard learning rate scheduling.

Technical Innovations

Preference alignment mechanism: First application of DPO to segmentation tasks, guiding model learning through human preference signals
End-to-end optimization: Directly optimizes combined objectives of translation quality and latency, rather than relying solely on acoustic features
Streaming processing architecture: Designs a sliding window mechanism suitable for real-time processing
Multimodal fusion: Combines acoustic features and language model capabilities for segmentation decisions

Experimental Setup

Datasets

Training data: CoVoST2 corpus for constructing preference pairs for DPO training
Evaluation data: ACL 60/60 test set containing technical talks from ACL 2022
Language pairs: English→Japanese, English→Chinese, English→German

Evaluation Metrics

Translation quality: BLEU score
Latency: Streaming LAAL (Streaming Long Average Lagging), reflecting system latency under actual streaming conditions

Baseline Methods

IWSLT baseline: Fixed-length chunking and VAD-based segmentation
SHAS: Re-implemented pre-trained segmentation model

Implementation Details

Model: Qwen2.5-Omni-3B as segmentation backbone
Training settings: 5 epochs, batch size 1, AdamW optimizer, learning rate 5×10⁻⁵
Hardware: 4 NVIDIA A100 GPUs
Inference settings: Sliding window size 4 seconds, hop size 2 seconds

Experimental Results

Main Results

Method	En→De	En→Ja	En→Zh
Fixed	18.2/~3000	-/-	17.0/3000
VAD	21.8/3030	16.0/3010	20.5/3020
SHAS	23.6/3100	17.2/3050	22.0/3090
Ours (LLM+DPO)	25.5/3078	18.6/3120	23.4/3160

Note: Format is BLEU(↑)/Latency(ms, ↓)

Key Findings

Consistent improvements: Surpasses heuristic baselines and SHAS model across all three translation directions
Significant quality gains: Average improvement of approximately 1.5 BLEU compared to SHAS, with only ~100ms increase in latency
Language pair differences: En→De achieves highest BLEU, En→Zh shows moderate gains, En→Ja remains most challenging

Latency-Quality Trade-off Analysis

Analysis via latency-quality trade-off curves demonstrates that DPO-trained LLMs consistently outperform other segmentation strategies across the entire operating range, achieving higher BLEU scores at similar or lower latency.

Segmentation Method Development

Heuristic methods: Fixed wait-k strategies and similar approaches, limited in adapting to linguistic variation
Trainable methods: DiSeg introduces differentiable segmentation modules jointly trained with translation models through expectation training
Pre-trained models: Models like SHAS improve robustness through large-scale training

Multilingual Translation Systems

Large-scale multilingual multimodal translation systems such as SeamlessM4T provide strong backbones for speech translation tasks, demonstrating state-of-the-art performance across multiple languages.

Research Gap

To the authors' knowledge, no prior work has applied preference-based optimization to segmentation tasks in SimulST, and this work fills this gap.

Conclusions and Discussion

Main Conclusions

DPO effectiveness: Preference optimization enables models to learn segmentation aligned with human preferences, producing more natural boundaries and better quality-latency trade-offs
Performance improvements: Consistent improvements across three language directions compared to SHAS at approximately 3 seconds latency
Practical value: Demonstrates the potential of preference-tuned LLMs in real-time simultaneous interpretation

Limitations

Limited evaluation scope: Restricted to three language pairs; broader directional diversity needed to verify generalization
Computational overhead: The 3B parameter LLM introduces additional computational costs, potentially limiting deployment on resource-constrained devices
Stability issues: BLEU fluctuations observed at specific latency thresholds, indicating room for improvement in segmentation stability
Evaluation metric limitations: Reliance on BLEU and latency as automatic metrics; lacks human evaluation

Future Directions

Extend to more language pairs and domains
Optimize model efficiency for real-time deployment
Introduce human evaluation to validate automatic metrics
Explore more sophisticated preference modeling approaches

In-Depth Evaluation

Strengths

Strong novelty: First application of DPO to SimulST segmentation, opening new research directions
Sound methodology: The preference alignment approach addresses core limitations of existing methods and aligns with practical application requirements
Comprehensive experiments: Thorough evaluation across multiple language pairs with consistent and convincing results
High practical value: Provides a complete end-to-end system with potential for real-world deployment

Weaknesses

Insufficient theoretical analysis: Lacks in-depth theoretical analysis of why DPO is effective for segmentation tasks
Simplistic preference pair construction: Preference pairs based solely on BLEU and latency may lack comprehensiveness
Computational efficiency concerns: Real-time performance of the 3B parameter model may become a bottleneck for practical applications
Limited evaluation metrics: Primarily relies on automatic metrics without subjective quality assessment

Impact

Academic contribution: Introduces a new optimization paradigm to the SimulST segmentation field
Practical value: Provides improved segmentation solutions for real-time speech translation systems
Inspirational significance: Demonstrates the potential of preference learning in sequential decision-making tasks

Applicable Scenarios

Real-time conference interpretation: Simultaneous translation scenarios requiring low latency and high quality
Live caption generation: Applications with high segmentation quality requirements
Multilingual customer service systems: Requiring natural and fluent real-time translation interactions

References

The paper cites important works in related fields, including:

SHAS segmentation model Tsiamas et al., 2022
SeamlessM4T translation system Meta AI, 2023-2024
DPO optimization method Rafailov et al., 2023
ACL 60/60 evaluation benchmark Salesky et al., 2023

Overall Assessment: This is a technically innovative paper that for the first time introduces preference optimization to SimulST segmentation tasks. The methodology is sound and experimental results are convincing. While there is room for improvement in theoretical analysis and computational efficiency, the paper makes valuable contributions to the field and opens new research directions.