2025-11-16T10:07:12.234140

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Lu, Cheng, Luo et al.

Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.

academic

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Basic Information

Paper ID: 2501.00805
Title: SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
Authors: Haitian Lu, Gaofeng Cheng, Liuping Luo, Leying Zhang, Yanmin Qian, Pengyuan Zhang
Classification: eess.AS cs.CL cs.SD
Publication Date: January 1, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.00805

Abstract

Recent advances in "text-free" speech language models (SLM) based on speech units have achieved significant progress in generating natural speech, including non-verbal vocalizations. However, generated speech samples often lack semantic coherence. This paper proposes SLIDE (Spontaneous Spoken Dialogue Generation via Integration of SLM and LLM). Specifically, the method first leverages an LLM to generate textual content for spontaneous spoken dialogue, then converts the textual dialogue into phoneme sequences, uses a dual-tower transformer-based duration predictor to predict the duration of each phoneme, and finally employs an SLM conditioned on spoken phoneme sequences to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that the system can generate natural spoken dialogue while maintaining high semantic coherence.

Research Background and Motivation

Problem Definition

This research addresses the core contradiction in spontaneous spoken dialogue generation: how to ensure semantic coherence while maintaining speech naturalness. Spoken dialogue encompasses two key aspects:

Semantic Aspect: The meaningfulness of dialogue content, which is crucial for conveying accurate and relevant information
Naturalness Aspect: The fluency of turn-taking, including inter-pausal units (IPUs), overlaps, gaps, pauses, as well as natural dialogue events such as laughter and feedback

Limitations of Existing Approaches

Traditional Cascaded Systems: While offering strong semantic coherence (benefiting from LLMs trained on hundreds of billions of words), they have limited ability to generate natural dialogue because:
- They do not account for turn-taking events within any component
- They struggle to generate natural dialogue containing laughter and feedback
- The intermediate stage of encoding speech as text loses paralinguistic information
SLM-based Methods (e.g., dGSLM): Effectively capture dialogue elements and turn-taking patterns but face semantic coherence challenges:
- Speech unit granularity is too fine (typically only 20ms), unsuitable for modeling semantic content over extended context
- Fine-grained characteristics significantly increase the demand for large training datasets

Research Motivation

This paper proposes a hybrid approach that leverages text to capture semantic context while using speech units to preserve paralinguistic information (such as non-verbal vocalizations and turn-taking patterns), aiming to combine the advantages of traditional cascaded systems and SLM-based systems.

Core Contributions

Integration of LLM into Spoken Dialogue Generation Framework: Leverages LLM to generate textual dialogue, fully utilizing the advanced text generation capabilities of LLMs
Proposed Dual-Tower Transformer-based Phoneme Duration Prediction: Uses a dual-tower transformer model to predict the duration of each phoneme in written phoneme sequences, ensuring the preservation of turn-taking fluency
Conditioned dGSLM on Spoken Phoneme Sequences: Conditions dGSLM with spoken phoneme sequences derived from textual dialogue, effectively incorporating natural dialogue events into generated speech while maintaining semantic coherence

Methodology Details

Task Definition

Input: Prompt dialogue audio
Output: Semantically coherent and natural spoken dialogue continuation
Constraints: Generated dialogue must satisfy both semantic coherence and naturalness (including turn-taking, non-verbal vocalizations, etc.)

Model Architecture

The SLIDE model comprises three main components:

1. LLM-Driven Textual Dialogue Generation

Uses a speech recognition model (Whisper-v3) to transcribe prompt dialogue audio into text
Leverages an LLM (GPT-4o) to generate dialogue continuation, guiding it to produce colloquial-style dialogue
Excludes dialogue event markers (such as laughter, sigh), focusing on verbal feedback like "yeah," "right," "okay"

2. Dual-Tower Transformer-based Written Phoneme Sequence Duration Prediction

Obtains training data using forced alignment models on ground truth transcriptions in the training dataset
Introduces additional silence phonemes, repeating each phoneme according to durations determined by forced alignment
Training phase: Uses teacher forcing method with loss function combining marginal unit loss and marginal duration loss
Inference phase: Performs unconditional generation with replacement mechanism ensuring correspondence with written phoneme sequences

3. dGSLM Speech Dialogue Generation Conditioned on Spoken Phoneme Sequences

Training phase: Uses HuBERT encoder to encode spoken dialogue into audio tokens, with concatenated spoken phoneme sequences and audio tokens as input and training targets
Each dialogue sample is segmented into 80-second intervals containing 8000 discrete tokens (first 4000 as spoken phoneme sequences, latter 4000 as audio tokens)
Inference phase: Adjusts spoken phoneme sequences to fixed length of 4000 tokens, autoregressively generates audio tokens

Technical Innovations

Hybrid Representation Strategy: Innovatively combines the semantic modeling capability of text with the prosodic/paralinguistic modeling capability of speech units
Conditioned Generation Mechanism: Constrains dGSLM output through conditioning on spoken phoneme sequences, ensuring semantic coherence of generated dialogue
Temporal Alignment Processing: Maintains temporal correspondence between phoneme sequences and audio through duration prediction and repetition mechanisms

Experimental Setup

Dataset

Fisher Dataset: 2000 hours of stereo telephone conversation audio, sampled at 8kHz, resampled to 16kHz
Each dialogue sample is segmented into 80-second intervals for training

Evaluation Metrics

Objective Evaluation

Naturalness Assessment:
- Temporal distribution statistics of turn-taking events (IPUs, overlaps, gaps, pauses)
- Relevant statistics computed using pyannote.audio
Semantic Coherence Assessment:
- Transcribes generated spoken dialogue using Whisper-v3
- Computes perplexity of text transcriptions using DialoGPT

Subjective Evaluation

N-MOS (Naturalness Mean Opinion Score): Assesses natural dialogue events and turn-taking fluency
M-MOS (Meaningfulness Mean Opinion Score): Assesses logical consistency and meaningfulness of dialogue
Score range: 1-5, with at least 5 raters per sample

Comparison Methods

Cascaded System: Traditional cascaded approach (ASR+LLM+TTS)
dGSLM: Original generative spoken dialogue language model
SLIDE-1: Directly uses textual dialogue from test dataset
SLIDE-2: Uses textual dialogue generated by LLM

Implementation Details

Training with 6 A100 40GB GPUs
Duration predictor: batch size 48, 50,000 training steps
Conditioned dGSLM: batch size 96, 250,000 training steps
Generation temperature set to 1

Experimental Results

Main Results

Turn-taking Event Statistics

Model	IPUs/min	Pauses/min	Gaps/min	Overlaps/min
Cascaded	17.5	0.0	14.9	0.0
dGSLM	30.6	12.0	9.0	8.7
SLIDE-1	25.6	9.4	5.6	9.5
SLIDE-2	31.3	6.3	7.6	15.8
Ground Truth	27.3	9.9	8.9	8.2

Semantic Coherence and Subjective Evaluation

Model	Perplexity ↓	N-MOS ↑	M-MOS ↑
Cascaded	-	2.38±0.63	2.70±0.38
dGSLM	1228.82	4.14±0.78	1.52±0.40
SLIDE-1	532.81	4.37±0.46	3.94±0.81
SLIDE-2	421.29	4.06±0.41	4.08±0.49
Ground Truth	371.16	4.72±0.40	4.63±0.44

Key Findings

Significant Improvement in Semantic Coherence: SLIDE-2 achieves a 65.8% reduction in perplexity compared to dGSLM (from 1228.82 to 421.29), approaching the level of real dialogue (371.16)
Preservation of Naturalness: SLIDE performs comparably to dGSLM in turn-taking event statistics, significantly outperforming the cascaded system
Substantial Improvement in Meaningfulness: SLIDE-2's M-MOS improves by 270.0% compared to dGSLM, with only 11.9% relative gap from real dialogue

Ablation Study

The comparison between SLIDE-1 and SLIDE-2 validates the effectiveness of LLM-generated textual dialogue, demonstrating that good semantic coherence can be maintained even when using LLM-generated text rather than ground truth transcriptions.

Main Directions in Spoken Dialogue Generation

Traditional Cascaded Methods: ASR→LLM→TTS pipeline, strong semantics but poor naturalness
SLM-based Methods: Such as dGSLM, strong naturalness but poor semantic coherence
Hybrid Methods: SLIDE proposed in this paper belongs to this emerging direction

Advantages of This Work

Compared to existing work, SLIDE is the first to effectively balance semantic coherence and naturalness, addressing the trade-off between the two through an ingenious conditioning mechanism.

Conclusions and Discussion

Main Conclusions

SLIDE successfully combines the semantic modeling capability of LLMs with the prosodic modeling capability of SLMs, significantly improving semantic coherence while maintaining the naturalness of spoken dialogue, providing a new solution for spontaneous spoken dialogue generation.

Limitations

Computational Complexity: Requires training multiple model components with high computational cost
Data Dependency: Still requires large-scale spoken dialogue data for training
Domain Adaptability: Trained on Fisher dataset; generalization ability to other domains remains to be verified
Real-time Performance: Multi-stage processing may impact response speed for real-time dialogue generation

Future Directions

Explore end-to-end joint training strategies
Investigate more lightweight model architectures
Extend to multilingual and cross-domain scenarios
Improve efficiency for real-time dialogue generation

In-Depth Evaluation

Strengths

Strong Innovation: First to propose a hybrid architecture combining LLM and SLM, addressing the long-standing trade-off between semantic coherence and naturalness
Reasonable Method Design: Clear three-stage pipeline design with well-defined component functions and feasible technical approach
Comprehensive Experiments: Includes both objective and subjective evaluation, comprehensive comparison methods, and ablation studies validating design effectiveness
Significant Results: Achieves substantial improvement in semantic coherence (65.8% perplexity reduction) while maintaining naturalness

Shortcomings

System Complexity: Multi-stage pipeline increases system complexity, potentially affecting practicality and robustness
Computational Efficiency: Requires running multiple large models with high computational cost, presenting challenges for real-time applications
Error Propagation: Pipeline architecture may suffer from error accumulation, with errors from earlier stages affecting subsequent processing
Generalization Ability: Validated only on Fisher dataset; cross-domain and multilingual generalization capabilities remain unknown

Impact

Academic Value: Provides new research direction for spoken dialogue generation field, balancing semantic and prosodic modeling
Practical Potential: Has practical value in virtual assistants, dialogue systems, and other applications
Reproducibility: Provides detailed implementation details and open-source code, facilitating reproduction and improvement

Applicable Scenarios

Dialogue Systems: Intelligent assistants requiring generation of natural and meaningful spoken responses
Speech Synthesis: Dialogue-style TTS systems requiring high naturalness
Education and Training: Spoken dialogue training and language learning applications
Entertainment Media: Games, virtual characters, and other scenarios requiring natural dialogue

References

This paper cites 34 relevant references covering important works in speech language models, large language models, dialogue generation, speech synthesis, and other related fields, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality research paper that innovatively addresses key challenges in spoken dialogue generation. While facing challenges in system complexity and computational efficiency, its technical contributions and experimental results are highly convincing, providing valuable new insights for the development of this field.