Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
Zhou, Wang, Zhang et al.
In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.
academic
Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
This paper introduces Hierarchical Diffusion Language Models (HDLM)—a novel discrete diffusion model for language modeling. HDLM is built upon a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to higher-level ancestors with more abstract semantics according to a scheduler, while in the reverse process, the model progressively predicts the next more detailed semantic level. HDLM provides a general time-varying next semantic scale prediction process for language modeling. The authors derive a closed-form expression for the diffusion evidence lower bound (ELBO) and demonstrate that HDLM can be flexibly implemented while treating existing MDLMs as special cases.
Existing discrete diffusion language models have several fundamental limitations:
Masked Diffusion: All masked tokens share the same mask embedding, lacking rich semantics; unable to self-correct already-generated tokens
Uniform Diffusion: The same token serves as noise at noisy stages but becomes meaningful during decoding, resulting in semantic inconsistency and confusion
Although autoregressive language models are currently state-of-the-art, their next-token prediction scheme fundamentally limits the ability to revise previously generated tokens. Diffusion models have gained attention for their progressive denoising and refinement capabilities, but existing discrete diffusion methods still have significant limitations in language modeling.
The authors propose maximizing the advantages of diffusion models by introducing semantic hierarchies, enabling arbitrary-order generation and progressive self-refinement, similar to next-scale prediction in visual autoregressive models (VAR).
Proposed HDLM Framework: A general and flexible discrete diffusion language modeling framework implemented through time-varying next semantic scale prediction
Established Rigorous Theoretical Foundation: Based on continuous-time Markov chain (CTMC) framework, deriving closed-form ELBO for hierarchical discrete diffusion
Proved Compatibility: Theoretically demonstrated that MDLM is a special case of HDLM, showcasing framework generality
Proposed Practical Techniques: Based on theoretical insights, proposed improved training and sampling techniques
Achieved Performance Improvements: Consistently demonstrated lower validation and generation perplexity compared to baselines in text generation experiments
The HDLM task is to progressively predict more detailed tokens through hierarchical semantic structure given noisy input, until recovering the original vocabulary. Input consists of noisy tokens at different levels, and output is word-level prediction distributions.
Introducing perturbation probability ξ < 1, allowing word tokens to transition to incorrect clusters with probability 1-ξ, enhancing the model's self-correction capability.
The paper cites important works in diffusion models, language modeling, and discrete state space modeling, including D3PM, MDLM, GIDD and other key foundational works, as well as classic language models such as GPT series and BERT.