PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
Wang, Wang, Shi
Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.
academic
PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
Video generation technology has advanced rapidly, with diffusion transformer (DiT) based models demonstrating exceptional capabilities. However, they face challenges of slow inference speed and high memory consumption in practical deployment. This paper proposes the PipeDiT framework, which accelerates video generation through three innovations: (1) PipeSP algorithm implements pipelining of computation and communication in sequence parallelism; (2) DeDiVAE method decouples the diffusion module and VAE decoder to different GPU groups; (3) Aco attention cooperative processing optimizes GPU utilization. Experiments on OpenSoraPlan and HunyuanVideo demonstrate that PipeDiT achieves 1.06× to 4.02× speedup.
Practical requirements: Video generation services need to handle multiple concurrent queries, with inference efficiency directly impacting user experience and service costs
Hardware constraints: Experiments show that under 48GB GPU memory limits, OpenSoraPlan cannot generate videos exceeding 1024×576×97 resolution, while HunyuanVideo is limited to 256×128×33
DistriFusion and PipeFusion are designed for image generation and are unsuitable for video generation's long sequence characteristics
Video generation optimization methods:
Teacache and similar methods: Reduce computation by reusing timestep features, but may degrade generation quality
Sequence parallelism (SP) methods:
Ulysses: Implements parallelism by partitioning attention heads, but suffers from serial execution of computation and communication, with underutilized GPU resources
Ring-Attention: Supports higher parallelism but incurs large communication overhead
USP: Combines both but introduces additional communication overhead
Offloading strategies:
Reduce memory consumption through CPU-GPU data transfer, but introduce significant transfer overhead and poor efficiency
PipeSP Algorithm: Proposes a pipelined sequence parallelism method that overlaps computation and communication by partitioning along the attention head dimension and immediately triggering All-to-All communication, improving GPU utilization
DeDiVAE Module Decoupling: Allocates diffusion module and VAE decoder to different GPU groups, implementing module-level pipeline parallelism and significantly reducing peak memory consumption (up to 53.3% reduction for OpenSoraPlan)
Aco Attention Cooperative Processing: Decomposes DiT blocks at fine granularity into linear projections and attention computation, allowing idle decoding GPU groups to participate in attention computation, further improving overall efficiency
System Implementation and Verification: Implemented on OpenSoraPlan (2B parameters) and HunyuanVideo (13B parameters), with extensive experiments on 8-GPU systems demonstrating method effectiveness and scalability
All attention head computations complete before a single All-to-All communication is executed
GPUs remain idle while waiting for communication
PipeSP design (Algorithm 1):
For each attention head j ∈ [0, h-1]:
1. Compute attention(Q[:,j,:,:], K[:,j,:,:], V[:,j,:,:])
2. Record CUDA event marking computation completion
3. Immediately trigger All-to-All communication after event completion
4. Collect results
Post-processing alignment (resolving result misalignment):
Through sequence transformation view(-1, h, n, D) → permute(0, 2, 1, 3) → view(-1, nh, D)
Maps interleaved tensors to the head-contiguous layout expected by original Ulysses
Computation-communication overlap: PipeSP achieves effective communication hiding in Ulysses for the first time through fine-grained head-level pipelining
Module-level decoupling: DeDiVAE breaks traditional co-location design, achieving dual optimization of memory and computation through GPU group separation
Dynamic resource scheduling: Aco dynamically utilizes idle GPU resources based on workload, avoiding efficiency loss from traditional static allocation
Mathematical rigor: Provides formal correctness proof of PipeSP transformation, ensuring optimization does not alter computation results
Note: HunyuanVideo's DeDiVAE memory higher than offloading is because large text encoder is co-located with VAE decoder, demonstrating method's flexible adaptability.
DistriFusion (Li et al. 2024b): Patch-level parallelism for image generation
Teacache (Liu et al. 2025): Timestep feature reuse method
OpenSoraPlan (PKU-YuanGroup 2025): Open-source video generation framework
HunyuanVideo (Kong et al. 2024): Large-scale video generation model
Overall Assessment: This is a high-quality systems optimization paper addressing practical pain points in video generation DiT inference with innovative solutions. Three technical innovations work synergistically to form a complete optimization framework. Experimental design is comprehensive with convincing results. Main weaknesses are hardware dependency and limited depth in some analyses. Valuable reference for video generation service providers and systems optimization researchers. Authors are recommended to open-source code and verify long-term stability in production environments.