2025-11-14T10:58:11.492990

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Zhou, Wang, Zhang et al.

In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

academic

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Basic Information

Paper ID: 2510.08632
Title: Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
Authors: Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
Classification: cs.CL cs.LG
Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2510.08632

Abstract

This paper introduces Hierarchical Diffusion Language Models (HDLM)—a novel discrete diffusion model for language modeling. HDLM is built upon a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to higher-level ancestors with more abstract semantics according to a scheduler, while in the reverse process, the model progressively predicts the next more detailed semantic level. HDLM provides a general time-varying next semantic scale prediction process for language modeling. The authors derive a closed-form expression for the diffusion evidence lower bound (ELBO) and demonstrate that HDLM can be flexibly implemented while treating existing MDLMs as special cases.

Research Background and Motivation

1. Problems to Address

Existing discrete diffusion language models have several fundamental limitations:

Masked Diffusion: All masked tokens share the same mask embedding, lacking rich semantics; unable to self-correct already-generated tokens
Uniform Diffusion: The same token serves as noise at noisy stages but becomes meaningful during decoding, resulting in semantic inconsistency and confusion

2. Problem Significance

Although autoregressive language models are currently state-of-the-art, their next-token prediction scheme fundamentally limits the ability to revise previously generated tokens. Diffusion models have gained attention for their progressive denoising and refinement capabilities, but existing discrete diffusion methods still have significant limitations in language modeling.

3. Limitations of Existing Methods

MDLM and MD4: Masked tokens lack rich semantics and cannot self-correct
Uniform Discrete Diffusion: Poor performance with semantic inconsistency
GIDD: Although unifying masked and uniform noise, noise tokens still lack rich semantics with limited self-correction capability

4. Research Motivation

The authors propose maximizing the advantages of diffusion models by introducing semantic hierarchies, enabling arbitrary-order generation and progressive self-refinement, similar to next-scale prediction in visual autoregressive models (VAR).

Core Contributions

Proposed HDLM Framework: A general and flexible discrete diffusion language modeling framework implemented through time-varying next semantic scale prediction
Established Rigorous Theoretical Foundation: Based on continuous-time Markov chain (CTMC) framework, deriving closed-form ELBO for hierarchical discrete diffusion
Proved Compatibility: Theoretically demonstrated that MDLM is a special case of HDLM, showcasing framework generality
Proposed Practical Techniques: Based on theoretical insights, proposed improved training and sampling techniques
Achieved Performance Improvements: Consistently demonstrated lower validation and generation perplexity compared to baselines in text generation experiments

Methodology Details

Task Definition

The HDLM task is to progressively predict more detailed tokens through hierarchical semantic structure given noisy input, until recovering the original vocabulary. Input consists of noisy tokens at different levels, and output is word-level prediction distributions.

Model Architecture

1. Hierarchical Vocabulary Design

Vocabulary Hierarchy: Hierarchical structure from clean word tokens x to cluster tokens c to mask tokens m: x → c → m
Mapping Relationship: Low-level tokens mapped to high-level tokens through surjective function c = Γx, where Γ ∈ R^{|C|×|V|}

2. Forward Process

The marginal distribution of the forward process is:

q_t(z_t|x) = Cat(z_t; α_t x + β_{t,c} c(x) + β_{t,m} m)

where β_{t,c} + β_{t,m} = β_t := 1 - α_t

3. CTMC Framework

The time-inhomogeneous generator matrix is:

Q_t = [α'_t/α_t I_{|V|}    -α'_t/α_t Γ^T    0]
      [0    (α'_t+β'_{t,c})/β_{t,c} I_{|C|}    -(α'_t+β'_{t,c})/β_{t,c} Ξ^T]
      [0    0    0]

4. Reverse Process

Employing standard reverse process:

p_θ(z_s|z_t) = q_{t|s}(z_t|z_s) q_s(z_s|x_θ)/q_t(z_t|x_θ)

Technical Innovations

1. Semantic Hierarchy Structure

Progressive Semantics: Intermediate levels can be viewed as partially decoded tokens, providing richer semantics than single mask tokens
Flexible Decoding: Uncertainty in coarse-grained semantics allows greater decoding flexibility

2. Closed-Form ELBO Derivation

The derived training loss is a weighted combination of two cross-entropy losses:

L(x,x_θ,t) = E_{t,z_t}[δ_{z_t,c} w_{t,c} CE(x, (x_θ ⊙ (Γ^T Γx))/(x_θ^T Γ^T Γx)) + δ_{z_t,m} w_{t,m} CE(Γx, Γx_θ)]

3. Stochastic Perturbation Mechanism

Introducing perturbation probability ξ < 1, allowing word tokens to transition to incorrect clusters with probability 1-ξ, enhancing the model's self-correction capability.

Experimental Setup

Datasets

Primary Dataset: OpenWebText (OWT), containing 131B training tokens
Additional Dataset: LM1B (33B tokens) for supplementary validation
Context Length: 512 tokens, without sentence packing

Evaluation Metrics

Validation Perplexity (Valid. PPL): Perplexity on OWT validation set
Generation Perplexity (Gen. PPL): Evaluated on generated samples using GPT2-large as reference model
Downstream Tasks: ARC, BoolQ, PIQA, OpenBookQA, WinoGrande, etc.

Baseline Methods

Autoregressive Models: GPT-2, Llama-110M
Discrete Diffusion Models: SEDD, MDLM, GIDD+

Implementation Details

Model Architecture: DiT architecture, Small (170M parameters) and Base (425M parameters)
Optimizer: Adam (β=(0.9,0.99)), learning rate 5×10^{-4}
Training Steps: 500k steps, batch size 512
Weight Clipping: Loss weights w_{t,m}, w_{t,c} clipped to 2.0 or 10.0 for optimization stability

Experimental Results

Main Results

Model	Training Tokens	Valid. PPL (↓)	Gen. PPL (↓)
MDLM-small	131B	≤27.39	163.7
GIDD+-small	131B	≤25.82	170.2
HDLM-small-64	131B	≤23.36	144.2
HDLM-small-128	131B	≤23.25	148.0
HDLM-base-128	131B	≤19.22	139.9

Key Findings:

HDLM-small outperforms other discrete diffusion methods in both validation and generation perplexity
HDLM-base achieves 19.22 perplexity, surpassing or matching autoregressive model performance

Ablation Studies

1. Impact of Cluster Count

Optimal cluster count approximately 64-128 (roughly the square root of vocabulary size)
n=1 recovers MDLM performance, validating theoretical analysis

2. Stochastic Perturbation Effect

Generation perplexity reduced by 51% at ξ=0.9 (from 144.2 to 69.76)
Generation perplexity reduced by 62% at ξ=0.8 (to 54.15)
Demonstrates significant improvement in self-correction capability

3. Forward Process Scheduling

Larger γ values make single-step denoising tasks more difficult, but actual inference performance improves
Best generation perplexity 135.9 achieved at γ=3

Downstream Task Performance

On multiple comprehension tasks, HDLM-small achieves average accuracy of 39.62%, outperforming GIDD's 38.53%, demonstrating strong generalization capability.

1. Discrete Diffusion Model Development

D3PM: Established theoretical foundations for discrete diffusion
SEDD: Learning concrete scores as marginal distribution ratios
MDLM/MD4: Simplified training objectives for masked forward processes

2. Scaling Diffusion Language Models

LLaDA and Dream: Demonstrated scalability potential of diffusion language models
Block Diffusion: Explored new paradigm of generating text blocks autoregressively and diffusing within blocks

Provides new noise procedure that is conceptually simple and practically effective
Maintains self-correction capability while avoiding uniform noise drawbacks
Establishes rigorous theoretical framework with closed-form ELBO

Conclusions and Discussion

Main Conclusions

HDLM effectively improves discrete diffusion language modeling through "next semantic scale prediction" scheme
Hierarchical semantic structure provides richer intermediate representations than traditional masking
Stochastic perturbation mechanism significantly enhances model self-correction capability
Theoretical framework demonstrates good generality and extensibility

Limitations

Clustering Quality Dependency: Currently uses predefined K-means clustering; clustering quality significantly impacts performance
Computational Complexity: Multi-level structure may increase computational overhead in training and inference
Hyperparameter Sensitivity: Requires careful tuning of hyperparameters such as weight clipping for training stability

Future Directions

Explore more sophisticated hierarchical structure learning methods (e.g., DeepSets)
Investigate implementation and optimization of multiple intermediate levels
Extend framework to larger-scale language models
Explore applications in multimodal tasks

In-Depth Evaluation

Strengths

Solid Theoretical Contribution: Provides complete CTMC theoretical framework with rigorous mathematical derivations
Strong Method Innovation: First to introduce semantic hierarchy into discrete diffusion language models
Comprehensive Experimental Design: Includes thorough ablation studies and comparative experiments
High Practical Value: Proposed techniques can be directly applied to existing diffusion model frameworks

Weaknesses

Scale Limitations: Experiments primarily conducted on small-to-medium scale models; insufficient large-scale validation
Simple Clustering Method: Current semantic clustering approach is relatively basic, potentially limiting performance ceiling
Generation Quality Assessment: Primarily relies on perplexity metrics; lacks human evaluation and diversity analysis

Impact

Academic Contribution: Provides new research direction for discrete diffusion language modeling
Practical Value: Simple and easy-to-implement method with potential for real-world application
Reproducibility: Authors provide complete code implementation and detailed experimental settings

Applicable Scenarios

Text Generation Tasks: Particularly suitable for generation scenarios requiring progressive refinement
Controlled Text Generation: Hierarchical structure facilitates implementation of control at different granularities
Text Editing and Revision: Self-correction capability makes it suitable for text modification tasks

References

The paper cites important works in diffusion models, language modeling, and discrete state space modeling, including D3PM, MDLM, GIDD and other key foundational works, as well as classic language models such as GPT series and BERT.