2025-11-13T21:10:11.295731

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

He, Ray, Mallidi et al.

Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuous nature of speech and can lead to loss of fine-grained acoustic information.In this work, we investigate the TTS within the MLLM paradigm using continuous speech representations. We design a dual-head architecture and implement two complementary training strategies for a robust model. (1) A diffusion head generating continuous speech representations is added on the MLLM, which is on frame-level and strictly autoregressive. (2) The original language model head is retained to preserve multitask capability and to control the start and end of speech synthesis. (3) Masked training is employed to address exposure bias in autoregressive decoding. (4) To stabilize optimization, we propose a two-stage scheme where the LM is frozen in the second stage, ensuring the diffusion head learns from a fixed input distribution. Evaluations on LibriSpeech(PC) test-clean show that our approach achieves state-of-the-art autoregressive performance, with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00. The two-stage training yields a 46% relative WER reduction over the one-stage training baseline. These results highlight the effectiveness of combining autoregressive modeling with continuous-token diffusion, supported by a two-stage training procedure.

academic

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

Basic Information

Paper ID: 2510.12995
Title: Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
Authors: Xinlu He¹, Swayambhu Nath Ray², Harish Mallidi², Jia-Hong Huang², Ashwin Bellur², Chander Chandak², M. Maruf², Venkatesh Ravichandran²
Affiliations: ¹Worcester Polytechnic Institute, USA ²Amazon AGI, USA
Classification: eess.AS cs.SD
Venue: NeurIPS 2025 Workshop: Structured Probabilistic Inference & Generative Modeling (SPIGM)
Paper Link: https://arxiv.org/abs/2510.12995

Abstract

Unified multimodal large language model (MLLM) architectures have demonstrated promise in handling diverse tasks within a single framework. For text-to-speech (TTS) synthesis, current MLLM-based approaches rely on discrete token representations, which overlook the inherent continuous nature of speech and may result in loss of fine-grained acoustic information. This work investigates TTS using continuous speech representations within the MLLM paradigm. A dual-head architecture is designed and two complementary training strategies are implemented to construct a robust model. The approach achieves state-of-the-art autoregressive performance on LibriSpeech(PC) test-clean with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00.

Research Background and Motivation

Problem Definition

Current MLLM-based TTS methods face the following challenges:

Discretization Loss: Existing methods convert speech into discrete tokens, ignoring the continuous nature of speech and resulting in loss of fine-grained acoustic information
Quantization Bottleneck: Discrete quantization discards fine acoustic details, limiting speech naturalness and fidelity
Lack of Unified Framework: Absence of effective methods to generate high-quality continuous speech while maintaining MLLM's multi-task capabilities

Research Significance

Technical Necessity: With the advancement of multimodal AI, there is a need to handle text and speech tasks within a unified framework
Quality Enhancement: Continuous representations better preserve intrinsic properties of speech, improving synthesis quality
Application Value: Zero-shot speaker cloning technology has significant applications in personalized speech synthesis

Limitations of Existing Methods

Multi-stage Systems: Methods like VALL-E require multiple stages, increasing complexity
Information Loss: Discrete encoding loses fine-grained acoustic information
Training Instability: Joint optimization of diffusion models and LLMs suffers from distribution shift issues

Core Contributions

Novel Architecture: Proposes frame-level continuous token diffusion head integrated into autoregressive MLLM framework, distinct from existing chunk-level multi-frame designs
Dual-Head Design: Designs a dual-head architecture maintaining unified multimodal framework, with LM head supporting variable-length speech synthesis
Training Strategy: Employs masked training to mitigate autoregressive exposure bias, improving temporal consistency and model robustness
Optimization Scheme: Proposes two-stage training strategy stabilizing the optimization process, achieving 46% relative WER reduction and SOTA autoregressive performance on LibriSpeech(PC)

Methodology Details

Task Definition

Input: Text transcription and reference audio segment Output: High-quality speech with specified speaker characteristics Constraint: Implementation within unified MLLM framework while maintaining multi-task capabilities

Model Architecture

Overall Design

The model employs a dual-head architecture based on OPT-125M as the LLM backbone:

Diffusion Head: Generates continuous speech embeddings
Language Model Head: Predicts speech boundaries and control tokens
Multimodal Projection: Handles representation transformation across modalities

Continuous Token Generation

Given target sequence $x = \{x_1, ..., x_N\}$ , where $x_i \in \mathbb{R}^d$ represents the speech embedding of the i-th frame.

Inference Process:

z_i = C_θ(p, x̂_{<i})  # LLM generates conditional vector
x̂_i = Diffusion_φ(z_i)  # Diffusion head generates speech embedding

Training Process: Employs standard DDPM training with loss function:

L_diff(θ,φ) = E_t[||ε - ε̂||²]

where noise prediction $\hat{\varepsilon} = M_\phi(x_i^t, t, z_i)$

EOS Control Mechanism

Introduces special tokens for boundary control:

<speech_bos>: Triggers speech generation phase
<cont_speech_gen>: Continues speech frame generation
<eos>: Terminates speech generation

Total loss function:

L = L_LM + L_diff

Technical Innovations

1. Masked Autoregressive Learning

To mitigate exposure bias, employs masked training strategy:

Randomly masks historical frames with probability $p_{mask}$
Replaces masked frames with zero vectors
Trains model to handle imperfect historical information

2. Two-Stage Training

Stage 1: Joint training of MLLM and diffusion head Stage 2: Freezes MLLM, trains only diffusion head

This design addresses distribution shift issues and stabilizes the training process.

Experimental Setup

Datasets

Training Data: 50k hour subset of LibriVox corpus (from Libri-Light)
Evaluation Data: LibriSpeech(PC) test-clean dataset
Evaluation Protocol: Randomly selects 40 speakers, one utterance per speaker, with 3-second reference audio

Evaluation Metrics

Intelligibility: Word Error Rate (WER) - computed using Whisper-Large transcription
Speaker Similarity: Computed using ECAPA-TDNN embedding cosine similarity
- SIM-R: Similarity with reference audio
- SIM-G: Similarity with ground truth speech
Speech Quality: UTMOS - MOS predictor trained on large-scale human ratings

Baseline Methods

VALL-E: Discrete token method (400M parameters)
MegaTTS: Continuous token method (500M parameters)
Voicebox: Non-autoregressive continuous method (400M parameters)
StyleTTS2: Non-autoregressive continuous method (700M parameters)

Implementation Details

Backbone Network: OPT-125M
Speech Representation: 64-dimensional VAE embedding, 25fps
Speaker Embedding: 768-dimensional LAM embedding
Diffusion Parameters: T=1000 steps training, 100 steps inference, cosine noise schedule
Optimizer: Adam, no weight decay, FP16 mixed precision

Experimental Results

Main Results

Method	Modeling	Token Type	Parameters	WER(%)↓	SIM↑	UTMOS↑
VALL-E	AR+NAR	Discrete	400M	6.11	0.47	3.68
MegaTTS	AR+NAR	Continuous	500M	2.32	0.53	4.02
Voicebox	NAR	Continuous	400M	2.14	0.48	3.73
StyleTTS2	NAR	Continuous	700M	2.49	0.38	3.94
This Work	AR	Continuous	160M	1.95	0.54	4.00

Key Findings:

Achieves best performance with only 160M parameters
Achieves 46% relative WER reduction compared to Stage 1 baseline (3.61%→1.95%)
Outperforms larger models across all metrics

Ablation Studies

Masking Ratio Impact

Masking Ratio(%)	WER(%)↓	SIM-R↑	UTMOS↑
0	15.06	0.45	2.00
15	12.65	0.45	1.39
30	6.17	0.46	3.21
50	8.13	0.46	2.84

Finding: 30% masking ratio achieves optimal balance

Diffusion Head Depth Impact

MLP Layers	Stage 2 Fine-tuning	WER(%)↓	SIM-R↑	UTMOS↑
3	✗	6.17	0.46	3.10
6	✗	5.12	0.50	3.10
12	✗	3.61	0.49	3.21
12	✓	1.95	0.54	4.00

Finding: Deeper networks and two-stage training both yield significant improvements

Stopping Criterion Comparison

Stopping Criterion	WER(%)↓	SIM-R↑	UTMOS↑
GT-Dur.	29.36	0.48	2.55
GT-EP.	3.46	0.49	3.21
EOS Token	3.61	0.49	3.21

Finding: EOS token method achieves comparable performance without requiring oracle information

Zero-Shot TTS

Multi-stage Systems: VALL-E, SALAD and others employ multi-stage processing through semantic or codec tokens
Single-Stage Methods: MegaTTS, NaturalSpeech directly generate high-information continuous representations
This Work's Contribution: Achieves single-stage continuous speech generation within unified MLLM framework

Autoregressive Diffusion

Existing Methods: TransFusion and others attempt to combine autoregressive and diffusion approaches but face difficulties with strict causal generation
This Work's Innovation: Implements strict frame-level autoregressive continuous representation diffusion

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: Continuous token diffusion significantly outperforms discrete methods for TTS within MLLM framework
Efficiency Advantage: Achieves superior performance with fewer parameters (160M vs 400-700M)
Training Strategy Importance: Two-stage training and masked learning are crucial for performance improvement

Limitations

Computational Complexity: Diffusion process requires multi-step inference, incurring significant computational overhead
Monolingual Constraint: Currently validated only on English data
Speaker Generalization: Generalization capability to unseen speakers requires further verification
Real-time Performance: 100-step diffusion inference may impact real-time applications

Future Directions

Multilingual Extension: Extend to multilingual TTS tasks
Inference Acceleration: Investigate faster diffusion sampling methods
Unified Framework: Integrate additional speech tasks (ASR, speech translation, etc.)
Long-form Synthesis: Improve stability for long-sequence speech synthesis

In-Depth Evaluation

Strengths

Technical Innovation:
- First to implement frame-level continuous diffusion within MLLM framework
- Ingenious dual-head architecture design maintaining unity
- Two-stage training effectively addresses distribution shift issues
Comprehensive Experiments:
- Thorough ablation studies validating component contributions
- Multi-dimensional evaluation metrics (intelligibility, similarity, quality)
- Fair comparison with multiple strong baselines
Convincing Results:
- Significant performance improvement (46% relative WER reduction)
- Clear parameter efficiency advantage
- Achieves SOTA autoregressive performance

Weaknesses

Method Complexity:
- Requires two-stage training, increasing training complexity
- Multiple hyperparameters requiring tuning (masking ratio, diffusion steps, etc.)
Experimental Limitations:
- Validation on single dataset only
- Lack of subjective evaluation experiments
- Insufficient inference speed analysis
Theoretical Analysis:
- Relatively simple theoretical explanation for two-stage training
- Absence of convergence analysis

Impact

Academic Value: Provides new technical pathway for continuous speech generation in MLLMs
Practical Value: Achieves high-quality speech synthesis while maintaining unified framework
Reproducibility: Detailed implementation descriptions facilitate reproduction

Applicable Scenarios

Personalized Voice Assistants: Zero-shot speaker cloning capability
Multimodal Dialogue Systems: Unified text and speech processing
Content Creation: High-quality speech content generation
Assistive Technology: Speech synthesis services for visually and hearing impaired users

References

The paper cites 42 relevant references covering key works in multimodal LLMs, diffusion models, speech synthesis and related domains, providing solid theoretical foundation for this research.

Overall Assessment: This is a high-quality research work on speech synthesis within multimodal large language model frameworks. The proposed continuous token diffusion method is technically innovative, experimental results are convincing, and it provides valuable contributions to the development of unified multimodal AI systems. Despite certain limitations, its technical approach and experimental validation establish a solid foundation for subsequent research in this field.