Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
Geiping, Yang, Su
Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
academic
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
This paper investigates the connection between language models with recurrent depth (also known as universal transformers or recurrent transformers) and diffusion language models. Recurrent-depth models increase computational capacity through layer repetition, demonstrating advantages in reasoning tasks. Based on the similarities between these two model classes, the authors develop a novel diffusion forcing sampler to accelerate the generation process. The sampler decodes new tokens in each forward pass while recursively optimizing the latent states of these tokens in parallel. Theoretically, under the same time budget, this sampler is more expressive than baseline autoregressive generation. Importantly, the sampler can be directly applied to existing 3.5B-parameter recurrent-depth transformers without any fine-tuning, achieving up to 5× speedup.
Traditional large language models employ fixed-depth neural network architectures with relatively few layers (typically only two digits). While this design performs well in training efficiency and most tasks, it has limitations in complex tasks requiring multi-step logical reasoning, such as mathematics and programming. From a complexity theory perspective, fixed-depth transformers belong to the TC0 complexity class, which has restricted expressive capacity.
Inference Efficiency Issues: Although recurrent-depth models have greater expressive power, generation is slow as each recursion must execute sequentially
Parallelization Requirements: Modern GPU architectures provide opportunities for parallel computation, but traditional autoregressive generation cannot fully exploit them
Theoretical Contribution: Clarifies the connection between recurrent-depth models and diffusion models, establishing a theoretical bridge through diffusion forcing and block/wave-based inference strategies
Methodological Innovation: Proposes a diffusion forcing sampler applicable to recurrent-depth models, enabling parallelization of the inference process
Experimental Validation: Validates the method's effectiveness on the 3.5B-parameter Huginn-0125 model, achieving approximately 5× speedup on benchmarks including GSM8K, MATH500, HumanEval, and MBPP while maintaining comparable accuracy
Practical Value: The sampler can be directly applied to existing recurrent-depth models without retraining or fine-tuning
Given a recurrent-depth model and input prompt x, the objective is to accelerate the text generation process while maintaining generation quality. Specifically, the goal is to generate more tokens within the same time budget, or reduce generation time for the same number of tokens.
The recursive process is conditioned on embedded input e, allowing the sampler to perform "path correction" when conditions change without discarding partially computed states.
Different recursion depths can share KV caches, significantly reducing memory usage. Experiments show the model naturally supports KV cache sharing, requiring only storage of the latest recursion's KV states for each token position.
The paper cites extensive related work, including:
Dehghani et al. (2019): Original Universal Transformers work
Chen et al. (2024a): Diffusion Forcing method
Geiping et al. (2025): Huginn-0125 recurrent-depth model
Rombach et al. (2022): Latent space diffusion models
Leviathan et al. (2023): Speculative decoding method
Overall Assessment: This is a high-quality research paper with significant contributions in both theoretical innovation and practical value. The paper successfully establishes connections between two important model classes and proposes practical acceleration methods. While certain limitations exist, it provides valuable directions and foundations for future research.