2025-11-13T21:10:11.295731

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

He, Ray, Mallidi et al.
Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuous nature of speech and can lead to loss of fine-grained acoustic information.In this work, we investigate the TTS within the MLLM paradigm using continuous speech representations. We design a dual-head architecture and implement two complementary training strategies for a robust model. (1) A diffusion head generating continuous speech representations is added on the MLLM, which is on frame-level and strictly autoregressive. (2) The original language model head is retained to preserve multitask capability and to control the start and end of speech synthesis. (3) Masked training is employed to address exposure bias in autoregressive decoding. (4) To stabilize optimization, we propose a two-stage scheme where the LM is frozen in the second stage, ensuring the diffusion head learns from a fixed input distribution. Evaluations on LibriSpeech(PC) test-clean show that our approach achieves state-of-the-art autoregressive performance, with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00. The two-stage training yields a 46% relative WER reduction over the one-stage training baseline. These results highlight the effectiveness of combining autoregressive modeling with continuous-token diffusion, supported by a two-stage training procedure.
academic

Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs

Basic Information

  • Paper ID: 2510.12995
  • Title: Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
  • Authors: Xinlu He¹, Swayambhu Nath Ray², Harish Mallidi², Jia-Hong Huang², Ashwin Bellur², Chander Chandak², M. Maruf², Venkatesh Ravichandran²
  • Affiliations: ¹Worcester Polytechnic Institute, USA ²Amazon AGI, USA
  • Classification: eess.AS cs.SD
  • Venue: NeurIPS 2025 Workshop: Structured Probabilistic Inference & Generative Modeling (SPIGM)
  • Paper Link: https://arxiv.org/abs/2510.12995

Abstract

Unified multimodal large language model (MLLM) architectures have demonstrated promise in handling diverse tasks within a single framework. For text-to-speech (TTS) synthesis, current MLLM-based approaches rely on discrete token representations, which overlook the inherent continuous nature of speech and may result in loss of fine-grained acoustic information. This work investigates TTS using continuous speech representations within the MLLM paradigm. A dual-head architecture is designed and two complementary training strategies are implemented to construct a robust model. The approach achieves state-of-the-art autoregressive performance on LibriSpeech(PC) test-clean with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00.

Research Background and Motivation

Problem Definition

Current MLLM-based TTS methods face the following challenges:

  1. Discretization Loss: Existing methods convert speech into discrete tokens, ignoring the continuous nature of speech and resulting in loss of fine-grained acoustic information
  2. Quantization Bottleneck: Discrete quantization discards fine acoustic details, limiting speech naturalness and fidelity
  3. Lack of Unified Framework: Absence of effective methods to generate high-quality continuous speech while maintaining MLLM's multi-task capabilities

Research Significance

  1. Technical Necessity: With the advancement of multimodal AI, there is a need to handle text and speech tasks within a unified framework
  2. Quality Enhancement: Continuous representations better preserve intrinsic properties of speech, improving synthesis quality
  3. Application Value: Zero-shot speaker cloning technology has significant applications in personalized speech synthesis

Limitations of Existing Methods

  1. Multi-stage Systems: Methods like VALL-E require multiple stages, increasing complexity
  2. Information Loss: Discrete encoding loses fine-grained acoustic information
  3. Training Instability: Joint optimization of diffusion models and LLMs suffers from distribution shift issues

Core Contributions

  1. Novel Architecture: Proposes frame-level continuous token diffusion head integrated into autoregressive MLLM framework, distinct from existing chunk-level multi-frame designs
  2. Dual-Head Design: Designs a dual-head architecture maintaining unified multimodal framework, with LM head supporting variable-length speech synthesis
  3. Training Strategy: Employs masked training to mitigate autoregressive exposure bias, improving temporal consistency and model robustness
  4. Optimization Scheme: Proposes two-stage training strategy stabilizing the optimization process, achieving 46% relative WER reduction and SOTA autoregressive performance on LibriSpeech(PC)

Methodology Details

Task Definition

Input: Text transcription and reference audio segment Output: High-quality speech with specified speaker characteristics Constraint: Implementation within unified MLLM framework while maintaining multi-task capabilities

Model Architecture

Overall Design

The model employs a dual-head architecture based on OPT-125M as the LLM backbone:

  1. Diffusion Head: Generates continuous speech embeddings
  2. Language Model Head: Predicts speech boundaries and control tokens
  3. Multimodal Projection: Handles representation transformation across modalities

Continuous Token Generation

Given target sequence x={x1,...,xN}x = \{x_1, ..., x_N\}, where xiRdx_i \in \mathbb{R}^d represents the speech embedding of the i-th frame.

Inference Process:

z_i = C_θ(p, x̂_{<i})  # LLM generates conditional vector
x̂_i = Diffusion_φ(z_i)  # Diffusion head generates speech embedding

Training Process: Employs standard DDPM training with loss function:

L_diff(θ,φ) = E_t[||ε - ε̂||²]

where noise prediction ε^=Mϕ(xit,t,zi)\hat{\varepsilon} = M_\phi(x_i^t, t, z_i)

EOS Control Mechanism

Introduces special tokens for boundary control:

  • <speech_bos>: Triggers speech generation phase
  • <cont_speech_gen>: Continues speech frame generation
  • <eos>: Terminates speech generation

Total loss function:

L = L_LM + L_diff

Technical Innovations

1. Masked Autoregressive Learning

To mitigate exposure bias, employs masked training strategy:

  • Randomly masks historical frames with probability pmaskp_{mask}
  • Replaces masked frames with zero vectors
  • Trains model to handle imperfect historical information

2. Two-Stage Training

Stage 1: Joint training of MLLM and diffusion head Stage 2: Freezes MLLM, trains only diffusion head

This design addresses distribution shift issues and stabilizes the training process.

Experimental Setup

Datasets

  • Training Data: 50k hour subset of LibriVox corpus (from Libri-Light)
  • Evaluation Data: LibriSpeech(PC) test-clean dataset
  • Evaluation Protocol: Randomly selects 40 speakers, one utterance per speaker, with 3-second reference audio

Evaluation Metrics

  1. Intelligibility: Word Error Rate (WER) - computed using Whisper-Large transcription
  2. Speaker Similarity: Computed using ECAPA-TDNN embedding cosine similarity
    • SIM-R: Similarity with reference audio
    • SIM-G: Similarity with ground truth speech
  3. Speech Quality: UTMOS - MOS predictor trained on large-scale human ratings

Baseline Methods

  • VALL-E: Discrete token method (400M parameters)
  • MegaTTS: Continuous token method (500M parameters)
  • Voicebox: Non-autoregressive continuous method (400M parameters)
  • StyleTTS2: Non-autoregressive continuous method (700M parameters)

Implementation Details

  • Backbone Network: OPT-125M
  • Speech Representation: 64-dimensional VAE embedding, 25fps
  • Speaker Embedding: 768-dimensional LAM embedding
  • Diffusion Parameters: T=1000 steps training, 100 steps inference, cosine noise schedule
  • Optimizer: Adam, no weight decay, FP16 mixed precision

Experimental Results

Main Results

MethodModelingToken TypeParametersWER(%)↓SIM↑UTMOS↑
VALL-EAR+NARDiscrete400M6.110.473.68
MegaTTSAR+NARContinuous500M2.320.534.02
VoiceboxNARContinuous400M2.140.483.73
StyleTTS2NARContinuous700M2.490.383.94
This WorkARContinuous160M1.950.544.00

Key Findings:

  • Achieves best performance with only 160M parameters
  • Achieves 46% relative WER reduction compared to Stage 1 baseline (3.61%→1.95%)
  • Outperforms larger models across all metrics

Ablation Studies

Masking Ratio Impact

Masking Ratio(%)WER(%)↓SIM-R↑UTMOS↑
015.060.452.00
1512.650.451.39
306.170.463.21
508.130.462.84

Finding: 30% masking ratio achieves optimal balance

Diffusion Head Depth Impact

MLP LayersStage 2 Fine-tuningWER(%)↓SIM-R↑UTMOS↑
36.170.463.10
65.120.503.10
123.610.493.21
121.950.544.00

Finding: Deeper networks and two-stage training both yield significant improvements

Stopping Criterion Comparison

Stopping CriterionWER(%)↓SIM-R↑UTMOS↑
GT-Dur.29.360.482.55
GT-EP.3.460.493.21
EOS Token3.610.493.21

Finding: EOS token method achieves comparable performance without requiring oracle information

Zero-Shot TTS

  • Multi-stage Systems: VALL-E, SALAD and others employ multi-stage processing through semantic or codec tokens
  • Single-Stage Methods: MegaTTS, NaturalSpeech directly generate high-information continuous representations
  • This Work's Contribution: Achieves single-stage continuous speech generation within unified MLLM framework

Autoregressive Diffusion

  • Existing Methods: TransFusion and others attempt to combine autoregressive and diffusion approaches but face difficulties with strict causal generation
  • This Work's Innovation: Implements strict frame-level autoregressive continuous representation diffusion

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: Continuous token diffusion significantly outperforms discrete methods for TTS within MLLM framework
  2. Efficiency Advantage: Achieves superior performance with fewer parameters (160M vs 400-700M)
  3. Training Strategy Importance: Two-stage training and masked learning are crucial for performance improvement

Limitations

  1. Computational Complexity: Diffusion process requires multi-step inference, incurring significant computational overhead
  2. Monolingual Constraint: Currently validated only on English data
  3. Speaker Generalization: Generalization capability to unseen speakers requires further verification
  4. Real-time Performance: 100-step diffusion inference may impact real-time applications

Future Directions

  1. Multilingual Extension: Extend to multilingual TTS tasks
  2. Inference Acceleration: Investigate faster diffusion sampling methods
  3. Unified Framework: Integrate additional speech tasks (ASR, speech translation, etc.)
  4. Long-form Synthesis: Improve stability for long-sequence speech synthesis

In-Depth Evaluation

Strengths

  1. Technical Innovation:
    • First to implement frame-level continuous diffusion within MLLM framework
    • Ingenious dual-head architecture design maintaining unity
    • Two-stage training effectively addresses distribution shift issues
  2. Comprehensive Experiments:
    • Thorough ablation studies validating component contributions
    • Multi-dimensional evaluation metrics (intelligibility, similarity, quality)
    • Fair comparison with multiple strong baselines
  3. Convincing Results:
    • Significant performance improvement (46% relative WER reduction)
    • Clear parameter efficiency advantage
    • Achieves SOTA autoregressive performance

Weaknesses

  1. Method Complexity:
    • Requires two-stage training, increasing training complexity
    • Multiple hyperparameters requiring tuning (masking ratio, diffusion steps, etc.)
  2. Experimental Limitations:
    • Validation on single dataset only
    • Lack of subjective evaluation experiments
    • Insufficient inference speed analysis
  3. Theoretical Analysis:
    • Relatively simple theoretical explanation for two-stage training
    • Absence of convergence analysis

Impact

  1. Academic Value: Provides new technical pathway for continuous speech generation in MLLMs
  2. Practical Value: Achieves high-quality speech synthesis while maintaining unified framework
  3. Reproducibility: Detailed implementation descriptions facilitate reproduction

Applicable Scenarios

  1. Personalized Voice Assistants: Zero-shot speaker cloning capability
  2. Multimodal Dialogue Systems: Unified text and speech processing
  3. Content Creation: High-quality speech content generation
  4. Assistive Technology: Speech synthesis services for visually and hearing impaired users

References

The paper cites 42 relevant references covering key works in multimodal LLMs, diffusion models, speech synthesis and related domains, providing solid theoretical foundation for this research.


Overall Assessment: This is a high-quality research work on speech synthesis within multimodal large language model frameworks. The proposed continuous token diffusion method is technically innovative, experimental results are convincing, and it provides valuable contributions to the development of unified multimodal AI systems. Despite certain limitations, its technical approach and experimental validation establish a solid foundation for subsequent research in this field.