2025-11-29T21:55:19.383942

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Wang, Wang, Shi

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

academic

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Basic Information

Paper ID: 2511.12056
Title: PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
Authors: Sijie Wang, Qiang Wang, Shaohuai Shi (Harbin Institute of Technology, Shenzhen Campus)
Classification: cs.CV, cs.AI, cs.DC
Publication Date: November 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.12056

Abstract

Video generation technology has advanced rapidly, with diffusion transformer (DiT) based models demonstrating exceptional capabilities. However, they face challenges of slow inference speed and high memory consumption in practical deployment. This paper proposes the PipeDiT framework, which accelerates video generation through three innovations: (1) PipeSP algorithm implements pipelining of computation and communication in sequence parallelism; (2) DeDiVAE method decouples the diffusion module and VAE decoder to different GPU groups; (3) Aco attention cooperative processing optimizes GPU utilization. Experiments on OpenSoraPlan and HunyuanVideo demonstrate that PipeDiT achieves 1.06× to 4.02× speedup.

Research Background and Motivation

Core Problems

Diffusion transformers (DiT) face two critical bottlenecks in video generation:

High inference latency: The inherent sequential nature of the reverse diffusion process severely limits parallelism
Large memory consumption: The VAE decoding phase consumes substantial memory due to upsampling to target resolution and frame rate

Problem Significance

Practical requirements: Video generation services need to handle multiple concurrent queries, with inference efficiency directly impacting user experience and service costs
Hardware constraints: Experiments show that under 48GB GPU memory limits, OpenSoraPlan cannot generate videos exceeding 1024×576×97 resolution, while HunyuanVideo is limited to 256×128×33

Limitations of Existing Methods

Image generation optimization methods:

DistriFusion and PipeFusion are designed for image generation and are unsuitable for video generation's long sequence characteristics

Video generation optimization methods:

Teacache and similar methods: Reduce computation by reusing timestep features, but may degrade generation quality
Sequence parallelism (SP) methods:
- Ulysses: Implements parallelism by partitioning attention heads, but suffers from serial execution of computation and communication, with underutilized GPU resources
- Ring-Attention: Supports higher parallelism but incurs large communication overhead
- USP: Combines both but introduces additional communication overhead

Offloading strategies:

Reduce memory consumption through CPU-GPU data transfer, but introduce significant transfer overhead and poor efficiency

Research Motivation

From performance analysis of OpenSoraPlan and HunyuanVideo (Figure 2):

Time bottleneck: Diffusion phase consumes far more time than other stages
Memory bottleneck: VAE decoding peak memory reaches 44GB (256×128×33 resolution)
Resource waste: Co-locating diffusion module and VAE decoder leads to serial execution and memory waste

Core Contributions

PipeSP Algorithm: Proposes a pipelined sequence parallelism method that overlaps computation and communication by partitioning along the attention head dimension and immediately triggering All-to-All communication, improving GPU utilization
DeDiVAE Module Decoupling: Allocates diffusion module and VAE decoder to different GPU groups, implementing module-level pipeline parallelism and significantly reducing peak memory consumption (up to 53.3% reduction for OpenSoraPlan)
Aco Attention Cooperative Processing: Decomposes DiT blocks at fine granularity into linear projections and attention computation, allowing idle decoding GPU groups to participate in attention computation, further improving overall efficiency
System Implementation and Verification: Implemented on OpenSoraPlan (2B parameters) and HunyuanVideo (13B parameters), with extensive experiments on 8-GPU systems demonstrating method effectiveness and scalability

Method Details

Task Definition

Video generation pipeline:

Input: Text prompts
Output: High-quality videos
Two-stage process:
1. Denoising stage: Diffusion model iteratively refines latent representations across multiple timesteps
2. Decoding stage: VAE decoder upsamples latent representations to full-resolution video

Model Architecture

1. PipeSP: Pipelined Sequence Parallelism

Original Ulysses problem:

All attention head computations complete before a single All-to-All communication is executed
GPUs remain idle while waiting for communication

PipeSP design (Algorithm 1):

For each attention head j ∈ [0, h-1]:
  1. Compute attention(Q[:,j,:,:], K[:,j,:,:], V[:,j,:,:])
  2. Record CUDA event marking computation completion
  3. Immediately trigger All-to-All communication after event completion
  4. Collect results

Post-processing alignment (resolving result misalignment):

Through sequence transformation view(-1, h, n, D) → permute(0, 2, 1, 3) → view(-1, nh, D)
Maps interleaved tensors to the head-contiguous layout expected by original Ulysses

Mathematical correctness proof: Define reshape mapping φ_{h,n} and permutation operation π, where composite mapping Ψ = φ^{-1}{h,n} ∘ π ∘ φ{h,n} satisfies:

(ΨT_mod)[b, k_orig(i,j), d] = T_mod[b, k_mod(i,j), d]

ensuring optimized results are identical to the original implementation.

2. DeDiVAE: Diffusion-VAE Module Decoupling

GPU grouping strategy:

Denoising group: N_denoise GPUs storing the diffusion backbone
Decoding group: N_decode = N - N_denoise GPUs storing the VAE decoder

Optimal GPU allocation: Based on first-order balance condition, equalizing execution time of both groups to maximize overlap:

N_decode ≈ ⌈(T_decode / (T_decode + T_denoise)) × N⌉

where T_denoise and T_decode are single-GPU denoising and decoding times respectively.

Multi-prompt pipelining:

Decoding of the first prompt executes in parallel with denoising of the second prompt
Latent representations are passed through shared queues, implementing producer-consumer pattern

3. Aco: Attention Cooperative Processing

Motivation: When denoising time far exceeds decoding time, decoding GPU groups remain idle most of the time

Fine-grained decomposition: Decompose DiT blocks into:

Linear projections: Q = XW_Q, K = XW_K, V = XW_V (executed by denoising group)
Attention kernel: Attn(Q,K,V) (can execute in parallel on decoding group)

Execution flow:

Prompt 1 phase (decoding queue empty):
1. Denoising group computes Q,K,V and sends to decoding group via P2P communication
2. Both groups execute attention computation in parallel
3. Results aggregated through All-to-All and P2P communication
Prompt 2 phase (decoding queue non-empty):
1. Denoising group executes attention computation independently
2. Decoding group executes VAE decoding in parallel

Performance analysis: Theoretical speedup:

S = T_baseline / T_coop = (t_L + t_A) / (t_L + t_A × N_denoise/N)

where t_L and t_A are linear projection and attention computation times respectively.

Handling non-divisible attention heads:

OpenSoraPlan: Introduces head dimension padding to ensure load balancing
HunyuanVideo/Wan: Supports USP, allowing flexible switching between Ulysses and Ring-Attention degrees, avoiding padding overhead

Technical Innovations

Computation-communication overlap: PipeSP achieves effective communication hiding in Ulysses for the first time through fine-grained head-level pipelining
Module-level decoupling: DeDiVAE breaks traditional co-location design, achieving dual optimization of memory and computation through GPU group separation
Dynamic resource scheduling: Aco dynamically utilizes idle GPU resources based on workload, avoiding efficiency loss from traditional static allocation
Mathematical rigor: Provides formal correctness proof of PipeSP transformation, ensuring optimization does not alter computation results

Experimental Setup

Test Platforms

System 1: 8× NVIDIA RTX A6000 (48GB)

CPU: Intel Xeon Platinum 8358 @2.60GHz
Interconnect: NVLink (112.5GB/s, 4×)

System 2: 8× NVIDIA L40 (48GB)

CPU: Intel Xeon Platinum 8358 @2.60GHz
Interconnect: PCIe 4.0 (x16)

Baseline Models

OpenSoraPlan v1.3.0: 2B parameters, using Ulysses sequence parallelism
HunyuanVideo: 13B parameters, integrating xDiT's USP

Evaluation Metrics

Single timestep latency: Measures PipeSP optimization effectiveness
End-to-end latency: Total time for generating multiple videos, measuring overall PipeDiT optimization
Peak GPU memory: Evaluates DeDiVAE memory optimization

Experimental Configuration

Resolution settings:

480×352 (65/97/129 frames)
640×352 (65/97/129 frames)
800×592 (65/97/129 frames)
1024×576 (65/97/129 frames)

Timestep counts: 10, 20, 30, 40, 50

Prompt quantities: 10 prompts (main experiments), additional configurations in supplementary materials

Comparison methods:

Baseline: Original implementation + offloading
PipeDiT (w/o Aco): PipeSP + DeDiVAE
PipeDiT (w/ Aco): Complete method

Experimental Results

Main Results

End-to-end Performance (Table 1)

OpenSoraPlan (A6000):

Maximum speedup: 480×352×97, 10 steps → 2.12× (227s → 107s)
High resolution: 1024×576×97, 50 steps → 1.18× (2162s → 1832s)
Trend: More significant speedup at lower resolutions, fewer frames, and shorter timesteps

HunyuanVideo (A6000):

Maximum speedup: 480×352×97, 10 steps → 3.27× (540s → 165s)
Large model advantage: Larger parameter count leads to higher offloading overhead, making PipeDiT optimization more effective
High resolution: 1024×576×97, 50 steps → 1.08× (3726s → 3453s)

Platform differences:

A6000 (NVLink) achieves higher speedup compared to L40 (PCIe)
Example: HunyuanVideo 480×352×97, 10 steps: A6000 3.27× vs L40 2.95×

Complete results in supplementary materials:

Maximum speedup reaches 4.02× (HunyuanVideo, 480×352×65, 10 steps)
Covers 12 resolutions × 5 timestep configurations, totaling 60 experiments

PipeSP Effectiveness (Table 2)

Optimal configuration: 640×352×129

OpenSoraPlan (A6000): 1.15× speedup (2.10s → 1.83s)
OpenSoraPlan (L40): 1.04× speedup (2.44s → 2.34s)

Performance characteristics:

Best results at medium resolutions (balancing computation and communication time)
Very low resolution: Communication overhead offsets gains
Very high resolution: Reduced communication proportion, lower optimization gains

Memory Optimization Results (Table 4)

OpenSoraPlan:

1024×576×129: Baseline OOM → Offloading 28.3GB → DeDiVAE 28.1GB
800×592×129: Baseline 39.8GB → DeDiVAE 18.6GB (53.3% reduction)
480×352×129: Baseline 26.5GB → DeDiVAE 18.0GB (32.1% reduction)

HunyuanVideo:

All configurations baseline OOM
Offloading: 29.37-33.01GB (31.2-38.8% reduction)
DeDiVAE: 41.44-42.12GB (12.2-13.7% reduction)

Note: HunyuanVideo's DeDiVAE memory higher than offloading is because large text encoder is co-located with VAE decoder, demonstrating method's flexible adaptability.

Ablation Study (Table 3)

Component contribution analysis (OpenSoraPlan A6000, 30 steps):

Configuration	480×352×65	640×352×129	1024×576×129
Baseline (A)	314s (1×)	665s (1×)	1995s (1×)
+DeDiVAE (B)	217s (1.45×)	500s (1.33×)	2138s (0.93×)
+PipeSP (C)	200s (1.57×)	509s (1.31×)	1936s (1.03×)
+Aco (D)	261s (1.20×)	507s (1.31×)	1690s (1.18×)

Key findings:

DeDiVAE: Significant improvement at low resolutions, reduced effectiveness at high resolutions due to fewer denoising GPUs
PipeSP: Pronounced effect on OpenSoraPlan (non-modular design allows more overlap)
Aco: Significant improvement under high workload, compensating for DeDiVAE's limitations at high resolutions

Aco performance heatmap (Figure 5):

Shows latency differences between PipeDiT w/ Aco and w/o Aco
Aco provides significant improvements in high-workload configurations

Case Study

Generation result consistency verification (Figure 6):

Under identical prompts, configurations, and sampling frame indices
PipeDiT generation results are identical to original algorithm
Proves optimization does not affect generation quality

Experimental Findings

Relationship between speedup and workload:
- Low resolution + short timesteps → highest speedup (4.02×)
- High resolution + long timesteps → still improved (1.06-1.18×)
- Reason: Increased computation time proportion reduces relative offloading bottleneck impact
Hardware interconnect impact:
- NVLink (A6000) vs PCIe (L40): Former achieves higher speedup
- High-bandwidth interconnect amplifies PipeSP communication hiding effect
Model scale impact:
- Large models (HunyuanVideo 13B) benefit more than small models (OpenSoraPlan 2B)
- Reason: Offloading overhead scales with model size
Future trend adaptation:
- Current trend: Fewer timesteps + more aggressive VAE compression
- Expectation: Reduced denoising time will further improve PipeDiT speedup
- MoE architectures (e.g., Wan2.2): Larger models make offloading impractical, PipeDiT advantages more pronounced

Image Generation Optimization

DistriFusion:

Partitions input into multiple patches distributed across GPUs
Reuses intermediate feature maps from previous timestep for context
Hides communication overhead through asynchronous communication
Limitation: Designed for images, unsuitable for video's long sequences

PipeFusion:

Partitions images into patches and distributes network layers across GPUs
Addresses memory limitations during generation
Limitation: Layer-level parallelism unsuitable for video generation's sequence characteristics

Video Generation Optimization

Timestep reduction methods:

Teacache: Analyzes feature correlation between adjacent timesteps, reuses previous step output
DeepCache, Delta-DiT, FORA: Similar strategies reducing timestep count
Limitation: May introduce generation quality degradation

Sequence parallelism methods:

Ulysses (DeepSpeed): Partitions by attention heads, 3 All-to-All before + 1 after, but computation and communication are serial
Ring-Attention: Partitions by sequence, P2P communication, supports high parallelism but large overhead
USP (Unified SP): Combines both, flexible configuration but increases communication overhead
This paper's contribution: First to implement effective computation-communication pipelining in Ulysses

Memory Optimization

Offloading strategies:

HunyuanVideo, Wan, OpenSoraPlan all adopt this approach
Dynamically transfer model weights between CPU-GPU
Limitation: Significant transfer overhead, poor efficiency

DeDiVAE in this paper:

Module-level decoupling + GPU group separation
Avoids offloading overhead while reducing peak memory

System-level Optimization

LightSeq, FlexSP, LoongServe:

Sequence parallelism for long-context Transformers
Difference: This paper focuses on specific optimizations for video generation DiT

xDiT:

DiT inference engine integrating USP
This paper's contribution: Implements PipeDiT on its foundation, demonstrating method generality

Conclusions and Discussion

Main Conclusions

PipeSP effectiveness: Achieves computation-communication overlap through head-level pipelining, improving single timestep latency by up to 15%
DeDiVAE breakthrough: Module decoupling + GPU group separation reduces peak memory by up to 53.3%, enabling high-resolution generation
Aco complementarity: Dynamic resource utilization compensates for DeDiVAE limitations under high load, achieving overall 1.06-4.02× speedup
Generality verification: Effective on both 2B (OpenSoraPlan) and 13B (HunyuanVideo) parameter models
Quality assurance: Optimization does not alter generation algorithm, output results identical to original implementation

Limitations

Hardware dependency:
- NVLink platforms outperform PCIe, sensitive to interconnect bandwidth
- Requires multi-GPU systems (experiments use 8-GPU)
Workload adaptability:
- Very high resolution + long timesteps show reduced speedup (computation-dominated)
- Aco may introduce overhead under low workload
Attention head constraints:
- Models not supporting USP require padding for non-divisible heads
- May cause some GPUs to execute redundant computation
Module co-location flexibility:
- HunyuanVideo requires co-locating text encoder with VAE
- Large encoders may offset some memory optimization gains
Multi-prompt dependency:
- DeDiVAE pipelining requires multiple concurrent queries for full overlap
- Single-prompt scenarios may have idle GPUs

Future Directions

Dynamic GPU allocation:
- Adaptively adjust N_denoise and N_decode based on real-time workload
- Consider optimal configurations for different resolutions and timesteps
Extension to more parallelism dimensions:
- Combine tensor parallelism and data parallelism
- Support larger-scale models (100B+ parameters)
Heterogeneous hardware support:
- Adapt to mixed systems with different GPU types
- Optimize communication strategies for PCIe interconnects
MoE architecture optimization:
- Specialized optimization for MoE models like Wan2.2
- Handle load imbalance from expert routing
End-to-end optimization:
- Integrate text encoder optimization
- Explore more aggressive VAE compression methods
Automatic tuning framework:
- Automatically search optimal hyperparameters based on hardware and model characteristics
- Simplify user deployment process

In-depth Evaluation

Strengths

Strong innovation:
- PipeSP first implements effective computation-communication pipelining in Ulysses
- DeDiVAE breaks traditional co-location paradigm, proposing novel module-level decoupling
- Aco dynamic resource scheduling reflects deep system design thinking
Theoretical rigor:
- Provides formal mathematical proof of PipeSP transformation (supplementary materials)
- Optimal GPU allocation based on theoretical derivation from first-order balance condition
- Aco performance analysis provides clear speedup formula
Comprehensive experiments:
- Two models (2B and 13B parameters) × two platforms (A6000 and L40)
- 12 resolutions × 5 timesteps = 60 configurations (complete results)
- Detailed ablation studies analyzing component contributions
- Generation result consistency verification ensures quality preservation
High practical value:
- Implemented on mainstream open-source frameworks, easy to reproduce and deploy
- Significantly reduces memory consumption, enabling high-resolution generation
- 1.06-4.02× speedup directly translates to reduced service costs
Clear writing:
- Complete logical structure, clear hierarchy from problem analysis to method design
- Rich figures (flowcharts, performance graphs, heatmaps) enhance readability
- Supplementary materials provide complete experimental data and theoretical proofs

Weaknesses

Method limitations:
- High hardware requirements: Needs multi-GPU systems and high-bandwidth interconnects
- Load dependency: Reduced pipeline efficiency in single-prompt scenarios
- Scalability: Ulysses limited by attention head count, switching to Ring-Attention increases complexity
Experimental design flaws:
- Lack of user studies: No evaluation of subjective generation quality perception
- Single metric focus: Primarily addresses latency and memory, neglects energy consumption, throughput
- Insufficient hardware coverage: Only tests 48GB GPUs, lacks verification on larger or smaller memory configurations
Insufficient analysis depth:
- Communication overhead details: Lacks detailed analysis of P2P vs All-to-All specific overhead
- Load balancing: Doesn't discuss impact of non-uniform attention head distribution
- Failure cases: Doesn't present scenarios where method is inapplicable
Incomplete comparisons:
- Missing recent methods: No comparison with latest optimization methods from 2024-2025
- Single baseline: Only compares with offloading, lacks other memory optimization strategies (quantization, pruning)
Reproducibility issues:
- Code not open-sourced: No code link provided at publication time
- Implementation details: Some details (e.g., event synchronization mechanism) insufficiently described

Impact

Contributions to the field:

Theoretical contribution: Proposes novel module-level decoupling system optimization paradigm
Practical contribution: Provides deployable acceleration solution for video generation services
Inspirational value: Fine-grained pipelining concepts generalizable to other multi-stage generation tasks

Potential impact:

Short-term: OpenSoraPlan and HunyuanVideo communities can directly adopt
Medium-term: Influences commercial video generation service architecture design
Long-term: Promotes DiT inference optimization as independent research direction

Citation prospects:

System optimization field: Important reference for multi-GPU inference optimization
Video generation field: Baseline acceleration method
Estimated 50-100 citations within 1-2 years

Applicable Scenarios

Best applicable scenarios:

Multi-user video generation services:
- High concurrent queries, high pipeline efficiency
- Latency-sensitive, speedup directly improves user experience
High-resolution video generation:
- Memory-constrained scenarios, DeDiVAE advantages pronounced
- Replaces inefficient offloading strategies
NVLink multi-GPU systems:
- High-bandwidth interconnect amplifies PipeSP effect
- A100/H100 and other data center GPUs
Large model inference:
- 13B+ parameter models, significant offloading overhead
- MoE architecture models

Inapplicable scenarios:

Single-GPU inference: Method depends on multi-GPU parallelism
Extremely low-resolution generation: Short computation time, small optimization gains
Single-prompt batch processing: Pipeline cannot fully overlap
PCIe interconnect + low workload: Communication overhead may offset gains

Deployment recommendations:

Evaluate workload: Concurrent query count, resolution distribution
Hardware configuration: Prioritize NVLink platforms
Parameter tuning: Adjust N_denoise/N_decode ratio based on model size
Monitor metrics: Latency, memory, GPU utilization

References

Key citations:

Ulysses (Jacobs et al. 2023): DeepSpeed-Ulysses sequence parallelism foundation
Ring-Attention (Li et al. 2021): Sequence dimension partitioning parallelism strategy
USP (Fang & Zhao 2024): Unified sequence parallelism framework
DistriFusion (Li et al. 2024b): Patch-level parallelism for image generation
Teacache (Liu et al. 2025): Timestep feature reuse method
OpenSoraPlan (PKU-YuanGroup 2025): Open-source video generation framework
HunyuanVideo (Kong et al. 2024): Large-scale video generation model

Overall Assessment: This is a high-quality systems optimization paper addressing practical pain points in video generation DiT inference with innovative solutions. Three technical innovations work synergistically to form a complete optimization framework. Experimental design is comprehensive with convincing results. Main weaknesses are hardware dependency and limited depth in some analyses. Valuable reference for video generation service providers and systems optimization researchers. Authors are recommended to open-source code and verify long-term stability in production environments.