2025-11-16T10:07:12.234140

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Lu, Cheng, Luo et al.
Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
academic

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Basic Information

  • Paper ID: 2501.00805
  • Title: SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
  • Authors: Haitian Lu, Gaofeng Cheng, Liuping Luo, Leying Zhang, Yanmin Qian, Pengyuan Zhang
  • Classification: eess.AS cs.CL cs.SD
  • Publication Date: January 1, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00805

Abstract

Recent advances in "text-free" speech language models (SLM) based on speech units have achieved significant progress in generating natural speech, including non-verbal vocalizations. However, generated speech samples often lack semantic coherence. This paper proposes SLIDE (Spontaneous Spoken Dialogue Generation via Integration of SLM and LLM). Specifically, the method first leverages an LLM to generate textual content for spontaneous spoken dialogue, then converts the textual dialogue into phoneme sequences, uses a dual-tower transformer-based duration predictor to predict the duration of each phoneme, and finally employs an SLM conditioned on spoken phoneme sequences to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that the system can generate natural spoken dialogue while maintaining high semantic coherence.

Research Background and Motivation

Problem Definition

This research addresses the core contradiction in spontaneous spoken dialogue generation: how to ensure semantic coherence while maintaining speech naturalness. Spoken dialogue encompasses two key aspects:

  1. Semantic Aspect: The meaningfulness of dialogue content, which is crucial for conveying accurate and relevant information
  2. Naturalness Aspect: The fluency of turn-taking, including inter-pausal units (IPUs), overlaps, gaps, pauses, as well as natural dialogue events such as laughter and feedback

Limitations of Existing Approaches

  1. Traditional Cascaded Systems: While offering strong semantic coherence (benefiting from LLMs trained on hundreds of billions of words), they have limited ability to generate natural dialogue because:
    • They do not account for turn-taking events within any component
    • They struggle to generate natural dialogue containing laughter and feedback
    • The intermediate stage of encoding speech as text loses paralinguistic information
  2. SLM-based Methods (e.g., dGSLM): Effectively capture dialogue elements and turn-taking patterns but face semantic coherence challenges:
    • Speech unit granularity is too fine (typically only 20ms), unsuitable for modeling semantic content over extended context
    • Fine-grained characteristics significantly increase the demand for large training datasets

Research Motivation

This paper proposes a hybrid approach that leverages text to capture semantic context while using speech units to preserve paralinguistic information (such as non-verbal vocalizations and turn-taking patterns), aiming to combine the advantages of traditional cascaded systems and SLM-based systems.

Core Contributions

  1. Integration of LLM into Spoken Dialogue Generation Framework: Leverages LLM to generate textual dialogue, fully utilizing the advanced text generation capabilities of LLMs
  2. Proposed Dual-Tower Transformer-based Phoneme Duration Prediction: Uses a dual-tower transformer model to predict the duration of each phoneme in written phoneme sequences, ensuring the preservation of turn-taking fluency
  3. Conditioned dGSLM on Spoken Phoneme Sequences: Conditions dGSLM with spoken phoneme sequences derived from textual dialogue, effectively incorporating natural dialogue events into generated speech while maintaining semantic coherence

Methodology Details

Task Definition

  • Input: Prompt dialogue audio
  • Output: Semantically coherent and natural spoken dialogue continuation
  • Constraints: Generated dialogue must satisfy both semantic coherence and naturalness (including turn-taking, non-verbal vocalizations, etc.)

Model Architecture

The SLIDE model comprises three main components:

1. LLM-Driven Textual Dialogue Generation

  • Uses a speech recognition model (Whisper-v3) to transcribe prompt dialogue audio into text
  • Leverages an LLM (GPT-4o) to generate dialogue continuation, guiding it to produce colloquial-style dialogue
  • Excludes dialogue event markers (such as laughter, sigh), focusing on verbal feedback like "yeah," "right," "okay"

2. Dual-Tower Transformer-based Written Phoneme Sequence Duration Prediction

  • Obtains training data using forced alignment models on ground truth transcriptions in the training dataset
  • Introduces additional silence phonemes, repeating each phoneme according to durations determined by forced alignment
  • Training phase: Uses teacher forcing method with loss function combining marginal unit loss and marginal duration loss
  • Inference phase: Performs unconditional generation with replacement mechanism ensuring correspondence with written phoneme sequences

3. dGSLM Speech Dialogue Generation Conditioned on Spoken Phoneme Sequences

  • Training phase: Uses HuBERT encoder to encode spoken dialogue into audio tokens, with concatenated spoken phoneme sequences and audio tokens as input and training targets
  • Each dialogue sample is segmented into 80-second intervals containing 8000 discrete tokens (first 4000 as spoken phoneme sequences, latter 4000 as audio tokens)
  • Inference phase: Adjusts spoken phoneme sequences to fixed length of 4000 tokens, autoregressively generates audio tokens

Technical Innovations

  1. Hybrid Representation Strategy: Innovatively combines the semantic modeling capability of text with the prosodic/paralinguistic modeling capability of speech units
  2. Conditioned Generation Mechanism: Constrains dGSLM output through conditioning on spoken phoneme sequences, ensuring semantic coherence of generated dialogue
  3. Temporal Alignment Processing: Maintains temporal correspondence between phoneme sequences and audio through duration prediction and repetition mechanisms

Experimental Setup

Dataset

  • Fisher Dataset: 2000 hours of stereo telephone conversation audio, sampled at 8kHz, resampled to 16kHz
  • Each dialogue sample is segmented into 80-second intervals for training

Evaluation Metrics

Objective Evaluation

  1. Naturalness Assessment:
    • Temporal distribution statistics of turn-taking events (IPUs, overlaps, gaps, pauses)
    • Relevant statistics computed using pyannote.audio
  2. Semantic Coherence Assessment:
    • Transcribes generated spoken dialogue using Whisper-v3
    • Computes perplexity of text transcriptions using DialoGPT

Subjective Evaluation

  • N-MOS (Naturalness Mean Opinion Score): Assesses natural dialogue events and turn-taking fluency
  • M-MOS (Meaningfulness Mean Opinion Score): Assesses logical consistency and meaningfulness of dialogue
  • Score range: 1-5, with at least 5 raters per sample

Comparison Methods

  • Cascaded System: Traditional cascaded approach (ASR+LLM+TTS)
  • dGSLM: Original generative spoken dialogue language model
  • SLIDE-1: Directly uses textual dialogue from test dataset
  • SLIDE-2: Uses textual dialogue generated by LLM

Implementation Details

  • Training with 6 A100 40GB GPUs
  • Duration predictor: batch size 48, 50,000 training steps
  • Conditioned dGSLM: batch size 96, 250,000 training steps
  • Generation temperature set to 1

Experimental Results

Main Results

Turn-taking Event Statistics

ModelIPUs/minPauses/minGaps/minOverlaps/min
Cascaded17.50.014.90.0
dGSLM30.612.09.08.7
SLIDE-125.69.45.69.5
SLIDE-231.36.37.615.8
Ground Truth27.39.98.98.2

Semantic Coherence and Subjective Evaluation

ModelPerplexity ↓N-MOS ↑M-MOS ↑
Cascaded-2.38±0.632.70±0.38
dGSLM1228.824.14±0.781.52±0.40
SLIDE-1532.814.37±0.463.94±0.81
SLIDE-2421.294.06±0.414.08±0.49
Ground Truth371.164.72±0.404.63±0.44

Key Findings

  1. Significant Improvement in Semantic Coherence: SLIDE-2 achieves a 65.8% reduction in perplexity compared to dGSLM (from 1228.82 to 421.29), approaching the level of real dialogue (371.16)
  2. Preservation of Naturalness: SLIDE performs comparably to dGSLM in turn-taking event statistics, significantly outperforming the cascaded system
  3. Substantial Improvement in Meaningfulness: SLIDE-2's M-MOS improves by 270.0% compared to dGSLM, with only 11.9% relative gap from real dialogue

Ablation Study

The comparison between SLIDE-1 and SLIDE-2 validates the effectiveness of LLM-generated textual dialogue, demonstrating that good semantic coherence can be maintained even when using LLM-generated text rather than ground truth transcriptions.

Main Directions in Spoken Dialogue Generation

  1. Traditional Cascaded Methods: ASR→LLM→TTS pipeline, strong semantics but poor naturalness
  2. SLM-based Methods: Such as dGSLM, strong naturalness but poor semantic coherence
  3. Hybrid Methods: SLIDE proposed in this paper belongs to this emerging direction

Advantages of This Work

Compared to existing work, SLIDE is the first to effectively balance semantic coherence and naturalness, addressing the trade-off between the two through an ingenious conditioning mechanism.

Conclusions and Discussion

Main Conclusions

SLIDE successfully combines the semantic modeling capability of LLMs with the prosodic modeling capability of SLMs, significantly improving semantic coherence while maintaining the naturalness of spoken dialogue, providing a new solution for spontaneous spoken dialogue generation.

Limitations

  1. Computational Complexity: Requires training multiple model components with high computational cost
  2. Data Dependency: Still requires large-scale spoken dialogue data for training
  3. Domain Adaptability: Trained on Fisher dataset; generalization ability to other domains remains to be verified
  4. Real-time Performance: Multi-stage processing may impact response speed for real-time dialogue generation

Future Directions

  1. Explore end-to-end joint training strategies
  2. Investigate more lightweight model architectures
  3. Extend to multilingual and cross-domain scenarios
  4. Improve efficiency for real-time dialogue generation

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to propose a hybrid architecture combining LLM and SLM, addressing the long-standing trade-off between semantic coherence and naturalness
  2. Reasonable Method Design: Clear three-stage pipeline design with well-defined component functions and feasible technical approach
  3. Comprehensive Experiments: Includes both objective and subjective evaluation, comprehensive comparison methods, and ablation studies validating design effectiveness
  4. Significant Results: Achieves substantial improvement in semantic coherence (65.8% perplexity reduction) while maintaining naturalness

Shortcomings

  1. System Complexity: Multi-stage pipeline increases system complexity, potentially affecting practicality and robustness
  2. Computational Efficiency: Requires running multiple large models with high computational cost, presenting challenges for real-time applications
  3. Error Propagation: Pipeline architecture may suffer from error accumulation, with errors from earlier stages affecting subsequent processing
  4. Generalization Ability: Validated only on Fisher dataset; cross-domain and multilingual generalization capabilities remain unknown

Impact

  1. Academic Value: Provides new research direction for spoken dialogue generation field, balancing semantic and prosodic modeling
  2. Practical Potential: Has practical value in virtual assistants, dialogue systems, and other applications
  3. Reproducibility: Provides detailed implementation details and open-source code, facilitating reproduction and improvement

Applicable Scenarios

  1. Dialogue Systems: Intelligent assistants requiring generation of natural and meaningful spoken responses
  2. Speech Synthesis: Dialogue-style TTS systems requiring high naturalness
  3. Education and Training: Spoken dialogue training and language learning applications
  4. Entertainment Media: Games, virtual characters, and other scenarios requiring natural dialogue

References

This paper cites 34 relevant references covering important works in speech language models, large language models, dialogue generation, speech synthesis, and other related fields, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality research paper that innovatively addresses key challenges in spoken dialogue generation. While facing challenges in system complexity and computational efficiency, its technical contributions and experimental results are highly convincing, providing valuable new insights for the development of this field.