2025-11-24T06:04:17.956351

DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

Yang, Nakamura
Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.
academic

DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

Basic Information

  • Paper ID: 2510.12195
  • Title: DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation
  • Authors: Zeyu Yang (CUHK, Shenzhen), Satoshi Nakamura (CUHK, Shenzhen & NAIST, Japan)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12195

Abstract

Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. While pre-trained segmentation models such as SHAS outperform heuristic rules, they remain constrained by supervised learning objectives and lack alignment with human preferences. This paper proposes a segmentation framework for large language models trained with Direct Preference Optimization (DPO), enabling LLMs to predict more natural segmentation points through preference alignment. Evaluation on three language pairs using the ACL 60/60 corpus with SeamlessM4T v2 as the translation backbone demonstrates that DPO-tuned LLMs surpass SHAS in segmentation accuracy, with sustained improvements in both translation quality (BLEU, COMET) and latency (average lagging).

Research Background and Motivation

Core Problem

The fundamental challenge in simultaneous speech translation (SimulST) is minimizing latency while maintaining translation quality, requiring the system to accurately determine when to segment the input stream and output translations. Improper segmentation results in incomplete or redundant translation units, severely impacting accuracy and user experience.

Problem Significance

Segmentation is considered a core component of practical SimulST systems, particularly in streaming SimulST, where improper boundaries significantly degrade translation quality and latency. Traditional heuristic rules (such as punctuation prediction and fixed-length chunking), while simple and efficient, often fail to adapt to diverse linguistic structures and speaking styles.

Limitations of Existing Methods

  1. Heuristic approaches: Fixed wait-k strategies and similar methods have limited adaptability to linguistic variation
  2. Pre-trained models: Models like SHAS, while more robust than heuristic methods, remain constrained by supervised learning objectives and rely solely on acoustic features
  3. Lack of human preference alignment: Existing methods do not incorporate alignment with machine translation performance, which is crucial for natural and timely translation

Research Motivation

Large language models have demonstrated superior generalization capabilities in speech and translation tasks, but their potential in SimulST segmentation remains underexplored. Direct Preference Optimization (DPO) offers a promising direction for aligning models with human feedback, enabling preference-guided decision-making beyond supervised training.

Core Contributions

  1. Proposes a DPO-optimized LLM segmentation framework: First application of preference optimization to SimulST segmentation tasks
  2. Constructs comprehensive experimental evaluation: Evaluation on three language pairs using the ACL 60/60 dataset with SeamlessM4T v2 as the translation backbone
  3. Demonstrates superiority of preference-tuned LLMs: Improvements in both translation quality and latency compared to the pre-trained segmentation model SHAS
  4. Provides complete end-to-end system: Integrates segmentation module with translation system for real-time simultaneous speech translation

Methodology

Task Definition

The segmentation task in SimulST is defined as predicting sentence boundaries in the incoming speech stream, with the objective of balancing translation quality and latency. Given a streaming input speech sequence x, the model produces a sequence of segmentation decisions {s₁, s₂, ..., sₜ}, where each sₜ represents a predicted boundary position. Unlike binary classification approaches, this work formulates segmentation as a next-boundary prediction problem.

Model Architecture

Base LLM

Employs Qwen2.5-Omni-3B as the segmentation backbone model, operating in a streaming fashion with a sliding window mechanism over speech input. The model directly processes chunk-level acoustic features rather than token-level ASR transcriptions, incrementally predicting the next segmentation point given the current speech context.

Preference Pair Construction

To incorporate human alignment signals, preference pairs of candidate segmentations are constructed:

  1. Generate candidate boundaries by combining multiple heuristic and pre-trained strategies (VAD, fixed-length segmentation, SHAS output)
  2. Evaluate each candidate segmentation using translation quality (BLEU) and latency (average lagging)
  3. Derive ranking signals from these metrics, with better-performing segmentations as preferred candidates
  4. Obtain approximately 8,000 preference pairs for training

DPO Training

Direct Preference Optimization is employed to fine-tune the LLM:

Given an input utterance x, multiple candidate segmentations are generated, with each segmentation y represented as a sequence of boundary indices on the input stream. Preference pairs (y_pref, y_dispref) are constructed, where y_pref represents the preferred segmentation yielding better translation quality and lower latency.

The DPO objective function is:

L(θ) = -E_{(x,y_pref,y_dispref)} [log σ(β · (log π_θ(y_pref | x) - log π_θ(y_dispref | x)))]

where π_θ represents the policy induced by the LLM and β is a scaling hyperparameter. Training proceeds for 5 epochs using standard learning rate scheduling.

Technical Innovations

  1. Preference alignment mechanism: First application of DPO to segmentation tasks, guiding model learning through human preference signals
  2. End-to-end optimization: Directly optimizes combined objectives of translation quality and latency, rather than relying solely on acoustic features
  3. Streaming processing architecture: Designs a sliding window mechanism suitable for real-time processing
  4. Multimodal fusion: Combines acoustic features and language model capabilities for segmentation decisions

Experimental Setup

Datasets

  • Training data: CoVoST2 corpus for constructing preference pairs for DPO training
  • Evaluation data: ACL 60/60 test set containing technical talks from ACL 2022
  • Language pairs: English→Japanese, English→Chinese, English→German

Evaluation Metrics

  • Translation quality: BLEU score
  • Latency: Streaming LAAL (Streaming Long Average Lagging), reflecting system latency under actual streaming conditions

Baseline Methods

  • IWSLT baseline: Fixed-length chunking and VAD-based segmentation
  • SHAS: Re-implemented pre-trained segmentation model

Implementation Details

  • Model: Qwen2.5-Omni-3B as segmentation backbone
  • Training settings: 5 epochs, batch size 1, AdamW optimizer, learning rate 5×10⁻⁵
  • Hardware: 4 NVIDIA A100 GPUs
  • Inference settings: Sliding window size 4 seconds, hop size 2 seconds

Experimental Results

Main Results

MethodEn→DeEn→JaEn→Zh
Fixed18.2/~3000-/-17.0/3000
VAD21.8/303016.0/301020.5/3020
SHAS23.6/310017.2/305022.0/3090
Ours (LLM+DPO)25.5/307818.6/312023.4/3160

Note: Format is BLEU(↑)/Latency(ms, ↓)

Key Findings

  1. Consistent improvements: Surpasses heuristic baselines and SHAS model across all three translation directions
  2. Significant quality gains: Average improvement of approximately 1.5 BLEU compared to SHAS, with only ~100ms increase in latency
  3. Language pair differences: En→De achieves highest BLEU, En→Zh shows moderate gains, En→Ja remains most challenging

Latency-Quality Trade-off Analysis

Analysis via latency-quality trade-off curves demonstrates that DPO-trained LLMs consistently outperform other segmentation strategies across the entire operating range, achieving higher BLEU scores at similar or lower latency.

Segmentation Method Development

  • Heuristic methods: Fixed wait-k strategies and similar approaches, limited in adapting to linguistic variation
  • Trainable methods: DiSeg introduces differentiable segmentation modules jointly trained with translation models through expectation training
  • Pre-trained models: Models like SHAS improve robustness through large-scale training

Multilingual Translation Systems

Large-scale multilingual multimodal translation systems such as SeamlessM4T provide strong backbones for speech translation tasks, demonstrating state-of-the-art performance across multiple languages.

Research Gap

To the authors' knowledge, no prior work has applied preference-based optimization to segmentation tasks in SimulST, and this work fills this gap.

Conclusions and Discussion

Main Conclusions

  1. DPO effectiveness: Preference optimization enables models to learn segmentation aligned with human preferences, producing more natural boundaries and better quality-latency trade-offs
  2. Performance improvements: Consistent improvements across three language directions compared to SHAS at approximately 3 seconds latency
  3. Practical value: Demonstrates the potential of preference-tuned LLMs in real-time simultaneous interpretation

Limitations

  1. Limited evaluation scope: Restricted to three language pairs; broader directional diversity needed to verify generalization
  2. Computational overhead: The 3B parameter LLM introduces additional computational costs, potentially limiting deployment on resource-constrained devices
  3. Stability issues: BLEU fluctuations observed at specific latency thresholds, indicating room for improvement in segmentation stability
  4. Evaluation metric limitations: Reliance on BLEU and latency as automatic metrics; lacks human evaluation

Future Directions

  1. Extend to more language pairs and domains
  2. Optimize model efficiency for real-time deployment
  3. Introduce human evaluation to validate automatic metrics
  4. Explore more sophisticated preference modeling approaches

In-Depth Evaluation

Strengths

  1. Strong novelty: First application of DPO to SimulST segmentation, opening new research directions
  2. Sound methodology: The preference alignment approach addresses core limitations of existing methods and aligns with practical application requirements
  3. Comprehensive experiments: Thorough evaluation across multiple language pairs with consistent and convincing results
  4. High practical value: Provides a complete end-to-end system with potential for real-world deployment

Weaknesses

  1. Insufficient theoretical analysis: Lacks in-depth theoretical analysis of why DPO is effective for segmentation tasks
  2. Simplistic preference pair construction: Preference pairs based solely on BLEU and latency may lack comprehensiveness
  3. Computational efficiency concerns: Real-time performance of the 3B parameter model may become a bottleneck for practical applications
  4. Limited evaluation metrics: Primarily relies on automatic metrics without subjective quality assessment

Impact

  1. Academic contribution: Introduces a new optimization paradigm to the SimulST segmentation field
  2. Practical value: Provides improved segmentation solutions for real-time speech translation systems
  3. Inspirational significance: Demonstrates the potential of preference learning in sequential decision-making tasks

Applicable Scenarios

  1. Real-time conference interpretation: Simultaneous translation scenarios requiring low latency and high quality
  2. Live caption generation: Applications with high segmentation quality requirements
  3. Multilingual customer service systems: Requiring natural and fluent real-time translation interactions

References

The paper cites important works in related fields, including:

  • SHAS segmentation model Tsiamas et al., 2022
  • SeamlessM4T translation system Meta AI, 2023-2024
  • DPO optimization method Rafailov et al., 2023
  • ACL 60/60 evaluation benchmark Salesky et al., 2023

Overall Assessment: This is a technically innovative paper that for the first time introduces preference optimization to SimulST segmentation tasks. The methodology is sound and experimental results are convincing. While there is room for improvement in theoretical analysis and computational efficiency, the paper makes valuable contributions to the field and opens new research directions.