2025-11-14T10:58:11.492990

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Zhou, Wang, Zhang et al.
In this paper we introduce Hierarchical Diffusion Language Models (HDLM) -- a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.
academic

Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Basic Information

  • Paper ID: 2510.08632
  • Title: Next Semantic Scale Prediction via Hierarchical Diffusion Language Models
  • Authors: Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola
  • Classification: cs.CL cs.LG
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.08632

Abstract

This paper introduces Hierarchical Diffusion Language Models (HDLM)—a novel discrete diffusion model for language modeling. HDLM is built upon a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to higher-level ancestors with more abstract semantics according to a scheduler, while in the reverse process, the model progressively predicts the next more detailed semantic level. HDLM provides a general time-varying next semantic scale prediction process for language modeling. The authors derive a closed-form expression for the diffusion evidence lower bound (ELBO) and demonstrate that HDLM can be flexibly implemented while treating existing MDLMs as special cases.

Research Background and Motivation

1. Problems to Address

Existing discrete diffusion language models have several fundamental limitations:

  • Masked Diffusion: All masked tokens share the same mask embedding, lacking rich semantics; unable to self-correct already-generated tokens
  • Uniform Diffusion: The same token serves as noise at noisy stages but becomes meaningful during decoding, resulting in semantic inconsistency and confusion

2. Problem Significance

Although autoregressive language models are currently state-of-the-art, their next-token prediction scheme fundamentally limits the ability to revise previously generated tokens. Diffusion models have gained attention for their progressive denoising and refinement capabilities, but existing discrete diffusion methods still have significant limitations in language modeling.

3. Limitations of Existing Methods

  • MDLM and MD4: Masked tokens lack rich semantics and cannot self-correct
  • Uniform Discrete Diffusion: Poor performance with semantic inconsistency
  • GIDD: Although unifying masked and uniform noise, noise tokens still lack rich semantics with limited self-correction capability

4. Research Motivation

The authors propose maximizing the advantages of diffusion models by introducing semantic hierarchies, enabling arbitrary-order generation and progressive self-refinement, similar to next-scale prediction in visual autoregressive models (VAR).

Core Contributions

  1. Proposed HDLM Framework: A general and flexible discrete diffusion language modeling framework implemented through time-varying next semantic scale prediction
  2. Established Rigorous Theoretical Foundation: Based on continuous-time Markov chain (CTMC) framework, deriving closed-form ELBO for hierarchical discrete diffusion
  3. Proved Compatibility: Theoretically demonstrated that MDLM is a special case of HDLM, showcasing framework generality
  4. Proposed Practical Techniques: Based on theoretical insights, proposed improved training and sampling techniques
  5. Achieved Performance Improvements: Consistently demonstrated lower validation and generation perplexity compared to baselines in text generation experiments

Methodology Details

Task Definition

The HDLM task is to progressively predict more detailed tokens through hierarchical semantic structure given noisy input, until recovering the original vocabulary. Input consists of noisy tokens at different levels, and output is word-level prediction distributions.

Model Architecture

1. Hierarchical Vocabulary Design

  • Vocabulary Hierarchy: Hierarchical structure from clean word tokens x to cluster tokens c to mask tokens m: x → c → m
  • Mapping Relationship: Low-level tokens mapped to high-level tokens through surjective function c = Γx, where Γ ∈ R^{|C|×|V|}

2. Forward Process

The marginal distribution of the forward process is:

q_t(z_t|x) = Cat(z_t; α_t x + β_{t,c} c(x) + β_{t,m} m)

where β_{t,c} + β_{t,m} = β_t := 1 - α_t

3. CTMC Framework

The time-inhomogeneous generator matrix is:

Q_t = [α'_t/α_t I_{|V|}    -α'_t/α_t Γ^T    0]
      [0    (α'_t+β'_{t,c})/β_{t,c} I_{|C|}    -(α'_t+β'_{t,c})/β_{t,c} Ξ^T]
      [0    0    0]

4. Reverse Process

Employing standard reverse process:

p_θ(z_s|z_t) = q_{t|s}(z_t|z_s) q_s(z_s|x_θ)/q_t(z_t|x_θ)

Technical Innovations

1. Semantic Hierarchy Structure

  • Progressive Semantics: Intermediate levels can be viewed as partially decoded tokens, providing richer semantics than single mask tokens
  • Flexible Decoding: Uncertainty in coarse-grained semantics allows greater decoding flexibility

2. Closed-Form ELBO Derivation

The derived training loss is a weighted combination of two cross-entropy losses:

L(x,x_θ,t) = E_{t,z_t}[δ_{z_t,c} w_{t,c} CE(x, (x_θ ⊙ (Γ^T Γx))/(x_θ^T Γ^T Γx)) + δ_{z_t,m} w_{t,m} CE(Γx, Γx_θ)]

3. Stochastic Perturbation Mechanism

Introducing perturbation probability ξ < 1, allowing word tokens to transition to incorrect clusters with probability 1-ξ, enhancing the model's self-correction capability.

Experimental Setup

Datasets

  • Primary Dataset: OpenWebText (OWT), containing 131B training tokens
  • Additional Dataset: LM1B (33B tokens) for supplementary validation
  • Context Length: 512 tokens, without sentence packing

Evaluation Metrics

  • Validation Perplexity (Valid. PPL): Perplexity on OWT validation set
  • Generation Perplexity (Gen. PPL): Evaluated on generated samples using GPT2-large as reference model
  • Downstream Tasks: ARC, BoolQ, PIQA, OpenBookQA, WinoGrande, etc.

Baseline Methods

  • Autoregressive Models: GPT-2, Llama-110M
  • Discrete Diffusion Models: SEDD, MDLM, GIDD+

Implementation Details

  • Model Architecture: DiT architecture, Small (170M parameters) and Base (425M parameters)
  • Optimizer: Adam (β=(0.9,0.99)), learning rate 5×10^{-4}
  • Training Steps: 500k steps, batch size 512
  • Weight Clipping: Loss weights w_{t,m}, w_{t,c} clipped to 2.0 or 10.0 for optimization stability

Experimental Results

Main Results

ModelTraining TokensValid. PPL (↓)Gen. PPL (↓)
MDLM-small131B≤27.39163.7
GIDD+-small131B≤25.82170.2
HDLM-small-64131B≤23.36144.2
HDLM-small-128131B≤23.25148.0
HDLM-base-128131B≤19.22139.9

Key Findings:

  • HDLM-small outperforms other discrete diffusion methods in both validation and generation perplexity
  • HDLM-base achieves 19.22 perplexity, surpassing or matching autoregressive model performance

Ablation Studies

1. Impact of Cluster Count

  • Optimal cluster count approximately 64-128 (roughly the square root of vocabulary size)
  • n=1 recovers MDLM performance, validating theoretical analysis

2. Stochastic Perturbation Effect

  • Generation perplexity reduced by 51% at ξ=0.9 (from 144.2 to 69.76)
  • Generation perplexity reduced by 62% at ξ=0.8 (to 54.15)
  • Demonstrates significant improvement in self-correction capability

3. Forward Process Scheduling

  • Larger γ values make single-step denoising tasks more difficult, but actual inference performance improves
  • Best generation perplexity 135.9 achieved at γ=3

Downstream Task Performance

On multiple comprehension tasks, HDLM-small achieves average accuracy of 39.62%, outperforming GIDD's 38.53%, demonstrating strong generalization capability.

1. Discrete Diffusion Model Development

  • D3PM: Established theoretical foundations for discrete diffusion
  • SEDD: Learning concrete scores as marginal distribution ratios
  • MDLM/MD4: Simplified training objectives for masked forward processes

2. Scaling Diffusion Language Models

  • LLaDA and Dream: Demonstrated scalability potential of diffusion language models
  • Block Diffusion: Explored new paradigm of generating text blocks autoregressively and diffusing within blocks
  • Provides new noise procedure that is conceptually simple and practically effective
  • Maintains self-correction capability while avoiding uniform noise drawbacks
  • Establishes rigorous theoretical framework with closed-form ELBO

Conclusions and Discussion

Main Conclusions

  1. HDLM effectively improves discrete diffusion language modeling through "next semantic scale prediction" scheme
  2. Hierarchical semantic structure provides richer intermediate representations than traditional masking
  3. Stochastic perturbation mechanism significantly enhances model self-correction capability
  4. Theoretical framework demonstrates good generality and extensibility

Limitations

  1. Clustering Quality Dependency: Currently uses predefined K-means clustering; clustering quality significantly impacts performance
  2. Computational Complexity: Multi-level structure may increase computational overhead in training and inference
  3. Hyperparameter Sensitivity: Requires careful tuning of hyperparameters such as weight clipping for training stability

Future Directions

  1. Explore more sophisticated hierarchical structure learning methods (e.g., DeepSets)
  2. Investigate implementation and optimization of multiple intermediate levels
  3. Extend framework to larger-scale language models
  4. Explore applications in multimodal tasks

In-Depth Evaluation

Strengths

  1. Solid Theoretical Contribution: Provides complete CTMC theoretical framework with rigorous mathematical derivations
  2. Strong Method Innovation: First to introduce semantic hierarchy into discrete diffusion language models
  3. Comprehensive Experimental Design: Includes thorough ablation studies and comparative experiments
  4. High Practical Value: Proposed techniques can be directly applied to existing diffusion model frameworks

Weaknesses

  1. Scale Limitations: Experiments primarily conducted on small-to-medium scale models; insufficient large-scale validation
  2. Simple Clustering Method: Current semantic clustering approach is relatively basic, potentially limiting performance ceiling
  3. Generation Quality Assessment: Primarily relies on perplexity metrics; lacks human evaluation and diversity analysis

Impact

  1. Academic Contribution: Provides new research direction for discrete diffusion language modeling
  2. Practical Value: Simple and easy-to-implement method with potential for real-world application
  3. Reproducibility: Authors provide complete code implementation and detailed experimental settings

Applicable Scenarios

  1. Text Generation Tasks: Particularly suitable for generation scenarios requiring progressive refinement
  2. Controlled Text Generation: Hierarchical structure facilitates implementation of control at different granularities
  3. Text Editing and Revision: Self-correction capability makes it suitable for text modification tasks

References

The paper cites important works in diffusion models, language modeling, and discrete state space modeling, including D3PM, MDLM, GIDD and other key foundational works, as well as classic language models such as GPT series and BERT.