2025-11-15T11:28:11.649653

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Geiping, Yang, Su

Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.

academic

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Basic Information

Paper ID: 2510.14961
Title: Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
Authors: Jonas Geiping, Xinyu Yang, Guinan Su
Classification: cs.LG cs.CL
Publication Date: October 16, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14961

Abstract

This paper investigates the connection between language models with recurrent depth (also known as universal transformers or recurrent transformers) and diffusion language models. Recurrent-depth models increase computational capacity through layer repetition, demonstrating advantages in reasoning tasks. Based on the similarities between these two model classes, the authors develop a novel diffusion forcing sampler to accelerate the generation process. The sampler decodes new tokens in each forward pass while recursively optimizing the latent states of these tokens in parallel. Theoretically, under the same time budget, this sampler is more expressive than baseline autoregressive generation. Importantly, the sampler can be directly applied to existing 3.5B-parameter recurrent-depth transformers without any fine-tuning, achieving up to 5× speedup.

Research Background and Motivation

Problem Definition

Traditional large language models employ fixed-depth neural network architectures with relatively few layers (typically only two digits). While this design performs well in training efficiency and most tasks, it has limitations in complex tasks requiring multi-step logical reasoning, such as mathematics and programming. From a complexity theory perspective, fixed-depth transformers belong to the TC0 complexity class, which has restricted expressive capacity.

Research Motivation

Computational Capacity Limitations: Fixed-depth models struggle with multi-step logical chains requiring conceptual leaps
Inference Efficiency Issues: Although recurrent-depth models have greater expressive power, generation is slow as each recursion must execute sequentially
Parallelization Requirements: Modern GPU architectures provide opportunities for parallel computation, but traditional autoregressive generation cannot fully exploit them

Limitations of Existing Methods

Chain-of-Thought Approaches: Require externalizing internal reasoning as small steps, increasing sequence length
Recurrent-Depth Models: While expressively powerful, each recursive step during inference must execute serially, resulting in slow generation
Traditional Parallelization Methods: Approaches like speculative decoding are primarily designed for fixed-depth models

Core Contributions

Theoretical Contribution: Clarifies the connection between recurrent-depth models and diffusion models, establishing a theoretical bridge through diffusion forcing and block/wave-based inference strategies
Methodological Innovation: Proposes a diffusion forcing sampler applicable to recurrent-depth models, enabling parallelization of the inference process
Experimental Validation: Validates the method's effectiveness on the 3.5B-parameter Huginn-0125 model, achieving approximately 5× speedup on benchmarks including GSM8K, MATH500, HumanEval, and MBPP while maintaining comparable accuracy
Practical Value: The sampler can be directly applied to existing recurrent-depth models without retraining or fine-tuning

Methodology Details

Task Definition

Given a recurrent-depth model and input prompt x, the objective is to accelerate the text generation process while maintaining generation quality. Specifically, the goal is to generate more tokens within the same time budget, or reduce generation time for the same number of tokens.

Model Architecture

Recurrent-Depth Model Structure

The recurrent-depth model used in this work (Huginn-0125) comprises three main components:

Prelude Block P: Projects embedded input tokens to latent space
Recurrent Block R: Iterates r times in latent space, performing reasoning through state vector optimization
Coda Block C: Processes latent states and produces probability distributions for the next token

Mathematical formulation:

e = P(x)
s₀ ~ N(0, σ²I)
sᵢ = R(e, sᵢ₋₁) for i ∈ {1, ..., r}
p = C(sᵣ)

Diffusion Forcing Sampler Design

The core idea applies diffusion forcing principles to recurrent-depth models, implementing "diagonal" parallelization:

Parallel Token Generation: Simultaneously process multiple token positions in each forward pass
Iterative Optimization: Progressively optimize latent states of all active tokens through recursive steps
Dynamic Freezing: Adaptive exit mechanism based on latent space distance

Technical Innovations

1. Input Injection Mechanism

The recursive process is conditioned on embedded input e, allowing the sampler to perform "path correction" when conditions change without discarding partially computed states.

Different recursion depths can share KV caches, significantly reducing memory usage. Experiments show the model naturally supports KV cache sharing, requiring only storage of the latest recursion's KV states for each token position.

3. Adaptive Exit Strategy

Uses normalized distance in latent space as exit criterion:

δᵢ = ||zᵢ - z_prev,ᵢ||₂ / ||zᵢ||₂

When δᵢ < ε, the corresponding token is frozen and added to the KV cache.

4. Stabilization Components

Momentum Mechanism: Adds momentum to input condition e: e = η·e_prev + (1-η)·P(y_current)
Noise Injection: Adds noise at each sampling step: z' = (1-βₜ)z + βₜ·z_noise

Experimental Setup

Datasets

GSM8K: Mathematical reasoning task using CoT version with 8-shot setting
MATH500: High-difficulty mathematical problems
HumanEval: Code generation task
MBPP: Python programming problems

Evaluation Metrics

Accuracy: Task-specific accuracy indicators
Generation Speed (Tokens/Second): Tokens generated per second, measured using CUDA events

Baseline Methods

Static Autoregressive (AR): Baseline methods with different recursion steps (r=4,8,32,64)
Adaptive Computation AR: Adaptive computation sampler from original work
Speculative Decoding: Finely-tuned self-speculative decoding baseline

Implementation Details

Batch size: 1 (single sequence inference)
Temperature: 0.2, top-p: 0.95
Default parameters: r'=4, ε=0.03, βₜ=0, η=0.1
Maximum wavefront size: 128
Hardware: A100-40GB GPU

Experimental Results

Main Results

The diffusion forcing sampler achieves significant speedup across all benchmarks:

Sampler	GSM8K	MATH500	HumanEval	MBPP
	Acc/t/s	Acc/t/s	Acc/t/s	Acc/t/s
Static AR (r=32)	41.77%/36.1	17.60%/6.4	22.56%/13.5	31.60%/15.3
Diff. Sampler	42.08%/157.3	18.00%/30.3	20.12%/64.9	31.00%/70.2
Relative Improvement	+0.31/4.36×	+0.40/4.73×	-2.44/4.81×	-0.60/4.59×

Ablation Studies

Hyperparameter Sensitivity Analysis

Internal Recursion Steps r': Increasing r' improves accuracy but reduces throughput; r'=4 provides optimal balance
Exit Threshold ε: Smaller ε values improve accuracy but reduce speed; ε=0.03 is recommended
Noise Coefficient βₜ: With smaller r', moderate noise (βₜ=0.2-0.3) aids stability
Wavefront Size: 64-128 is optimal for A100 GPU

Model Variant Verification

Robustness verified across different model checkpoints:

SWA Model: Weight-averaged version
Math Fine-tuned Model: Version fine-tuned on MetaMath dataset

All variants show consistent 4-5× speedup with accuracy deviation within 0.5-1%.

Theoretical Analysis Verification

Depth vs. Width Scaling

Experiments verify theoretical analysis predictions:

Prefill Phase: Depth scaling outperforms width scaling
Decoding Phase: Diffusion forcing sampler achieves better width scaling
Expressiveness: Under the same time budget, diffusion sampler strictly outperforms autoregressive generation

Recurrent Model Research

Historical Development: Evolution from early RNNs to universal transformers
Theoretical Foundation: Computational capacity of universal Turing machines and complexity classes
Practical Applications: Advantages in algorithmic learning and reasoning tasks

Diffusion Language Models

Continuous-Domain Diffusion: Successful applications in image generation
Discrete-Domain Diffusion: Challenges and solutions for text generation
Inference Strategies: Methods including block diffusion and diffusion forcing

Inference Acceleration Techniques

Speculative Decoding: Utilizing small models for drafting and large models for verification
Parallelization Strategies: Trade-offs between memory-bound and compute-bound operations

Conclusions and Discussion

Main Conclusions

Theoretical Contribution: Establishes theoretical connection between recurrent-depth models and diffusion models
Practical Value: Achieves 5× inference speedup while maintaining generation quality
Generality: Method can be directly applied to existing models without retraining
New Perspective: Recurrent-depth models can be viewed as continuous causal diffusion language models

Limitations

Batch Processing Constraints: Current implementation only supports single-sequence inference; batch scenarios require complex inference engines
FLOP Efficiency: While increasing parallelism, actual FLOP usage increases
Hardware Dependency: Optimal parameter settings depend on specific hardware configurations
Model Requirements: Requires models satisfying specific architectural requirements (input injection, robust recursion, etc.)

Future Directions

Batch Inference Engine: Develop systems supporting large-scale batch inference
Architecture Optimization: Design recurrent-depth architectures better suited for diffusion forcing sampling
Training Objectives: Explore unrolling objectives in diffusion language modeling
Theoretical Deepening: Further investigate theoretical foundations of recurrent-depth models as diffusion models

In-Depth Evaluation

Strengths

Strong Innovation: First to establish connection between recurrent-depth models and diffusion models, providing new theoretical perspective
High Practical Value: Achieves significant inference speedup directly applicable to existing models
Rigorous Theory: Provides theoretical analysis of depth vs. width scaling and convergence proofs
Comprehensive Experiments: Validates method effectiveness and robustness across multiple benchmarks and model variants

Weaknesses

Limited Applicability: Method requires models satisfying specific architectural requirements, limiting generalizability
Insufficient Batch Processing Support: Single-sequence inference limits production environment applications
Memory Overhead: Despite KV cache sharing, additional latent state storage required
Parameter Sensitivity: Multiple hyperparameters require task and hardware-specific tuning

Impact

Academic Contribution: Provides new intersection point for recurrent-depth and diffusion model research
Engineering Value: Offers new technical pathway for large model inference optimization
Inspirational Significance: May inspire further research on combining model architectures with sampling strategies
Applicable Scenarios: Single-user inference, reasoning-intensive tasks, resource-constrained environments, and research prototyping

References

The paper cites extensive related work, including:

Dehghani et al. (2019): Original Universal Transformers work
Chen et al. (2024a): Diffusion Forcing method
Geiping et al. (2025): Huginn-0125 recurrent-depth model
Rombach et al. (2022): Latent space diffusion models
Leviathan et al. (2023): Speculative decoding method

Overall Assessment: This is a high-quality research paper with significant contributions in both theoretical innovation and practical value. The paper successfully establishes connections between two important model classes and proposes practical acceleration methods. While certain limitations exist, it provides valuable directions and foundations for future research.