Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
He, Ray, Mallidi et al.
Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuous nature of speech and can lead to loss of fine-grained acoustic information.In this work, we investigate the TTS within the MLLM paradigm using continuous speech representations. We design a dual-head architecture and implement two complementary training strategies for a robust model. (1) A diffusion head generating continuous speech representations is added on the MLLM, which is on frame-level and strictly autoregressive. (2) The original language model head is retained to preserve multitask capability and to control the start and end of speech synthesis. (3) Masked training is employed to address exposure bias in autoregressive decoding. (4) To stabilize optimization, we propose a two-stage scheme where the LM is frozen in the second stage, ensuring the diffusion head learns from a fixed input distribution. Evaluations on LibriSpeech(PC) test-clean show that our approach achieves state-of-the-art autoregressive performance, with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00. The two-stage training yields a 46% relative WER reduction over the one-stage training baseline. These results highlight the effectiveness of combining autoregressive modeling with continuous-token diffusion, supported by a two-stage training procedure.
academic
Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
Unified multimodal large language model (MLLM) architectures have demonstrated promise in handling diverse tasks within a single framework. For text-to-speech (TTS) synthesis, current MLLM-based approaches rely on discrete token representations, which overlook the inherent continuous nature of speech and may result in loss of fine-grained acoustic information. This work investigates TTS using continuous speech representations within the MLLM paradigm. A dual-head architecture is designed and two complementary training strategies are implemented to construct a robust model. The approach achieves state-of-the-art autoregressive performance on LibriSpeech(PC) test-clean with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00.
Current MLLM-based TTS methods face the following challenges:
Discretization Loss: Existing methods convert speech into discrete tokens, ignoring the continuous nature of speech and resulting in loss of fine-grained acoustic information
Quantization Bottleneck: Discrete quantization discards fine acoustic details, limiting speech naturalness and fidelity
Lack of Unified Framework: Absence of effective methods to generate high-quality continuous speech while maintaining MLLM's multi-task capabilities
Novel Architecture: Proposes frame-level continuous token diffusion head integrated into autoregressive MLLM framework, distinct from existing chunk-level multi-frame designs
Dual-Head Design: Designs a dual-head architecture maintaining unified multimodal framework, with LM head supporting variable-length speech synthesis
Training Strategy: Employs masked training to mitigate autoregressive exposure bias, improving temporal consistency and model robustness
Optimization Scheme: Proposes two-stage training strategy stabilizing the optimization process, achieving 46% relative WER reduction and SOTA autoregressive performance on LibriSpeech(PC)
Input: Text transcription and reference audio segment
Output: High-quality speech with specified speaker characteristics
Constraint: Implementation within unified MLLM framework while maintaining multi-task capabilities
Existing Methods: TransFusion and others attempt to combine autoregressive and diffusion approaches but face difficulties with strict causal generation
This Work's Innovation: Implements strict frame-level autoregressive continuous representation diffusion
The paper cites 42 relevant references covering key works in multimodal LLMs, diffusion models, speech synthesis and related domains, providing solid theoretical foundation for this research.
Overall Assessment: This is a high-quality research work on speech synthesis within multimodal large language model frameworks. The proposed continuous token diffusion method is technically innovative, experimental results are convincing, and it provides valuable contributions to the development of unified multimodal AI systems. Despite certain limitations, its technical approach and experimental validation establish a solid foundation for subsequent research in this field.