2025-11-29T21:55:19.383942

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Wang, Wang, Shi
Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.
academic

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Basic Information

  • Paper ID: 2511.12056
  • Title: PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling
  • Authors: Sijie Wang, Qiang Wang, Shaohuai Shi (Harbin Institute of Technology, Shenzhen Campus)
  • Classification: cs.CV, cs.AI, cs.DC
  • Publication Date: November 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.12056

Abstract

Video generation technology has advanced rapidly, with diffusion transformer (DiT) based models demonstrating exceptional capabilities. However, they face challenges of slow inference speed and high memory consumption in practical deployment. This paper proposes the PipeDiT framework, which accelerates video generation through three innovations: (1) PipeSP algorithm implements pipelining of computation and communication in sequence parallelism; (2) DeDiVAE method decouples the diffusion module and VAE decoder to different GPU groups; (3) Aco attention cooperative processing optimizes GPU utilization. Experiments on OpenSoraPlan and HunyuanVideo demonstrate that PipeDiT achieves 1.06× to 4.02× speedup.

Research Background and Motivation

Core Problems

Diffusion transformers (DiT) face two critical bottlenecks in video generation:

  1. High inference latency: The inherent sequential nature of the reverse diffusion process severely limits parallelism
  2. Large memory consumption: The VAE decoding phase consumes substantial memory due to upsampling to target resolution and frame rate

Problem Significance

  • Practical requirements: Video generation services need to handle multiple concurrent queries, with inference efficiency directly impacting user experience and service costs
  • Hardware constraints: Experiments show that under 48GB GPU memory limits, OpenSoraPlan cannot generate videos exceeding 1024×576×97 resolution, while HunyuanVideo is limited to 256×128×33

Limitations of Existing Methods

Image generation optimization methods:

  • DistriFusion and PipeFusion are designed for image generation and are unsuitable for video generation's long sequence characteristics

Video generation optimization methods:

  • Teacache and similar methods: Reduce computation by reusing timestep features, but may degrade generation quality
  • Sequence parallelism (SP) methods:
    • Ulysses: Implements parallelism by partitioning attention heads, but suffers from serial execution of computation and communication, with underutilized GPU resources
    • Ring-Attention: Supports higher parallelism but incurs large communication overhead
    • USP: Combines both but introduces additional communication overhead

Offloading strategies:

  • Reduce memory consumption through CPU-GPU data transfer, but introduce significant transfer overhead and poor efficiency

Research Motivation

From performance analysis of OpenSoraPlan and HunyuanVideo (Figure 2):

  • Time bottleneck: Diffusion phase consumes far more time than other stages
  • Memory bottleneck: VAE decoding peak memory reaches 44GB (256×128×33 resolution)
  • Resource waste: Co-locating diffusion module and VAE decoder leads to serial execution and memory waste

Core Contributions

  1. PipeSP Algorithm: Proposes a pipelined sequence parallelism method that overlaps computation and communication by partitioning along the attention head dimension and immediately triggering All-to-All communication, improving GPU utilization
  2. DeDiVAE Module Decoupling: Allocates diffusion module and VAE decoder to different GPU groups, implementing module-level pipeline parallelism and significantly reducing peak memory consumption (up to 53.3% reduction for OpenSoraPlan)
  3. Aco Attention Cooperative Processing: Decomposes DiT blocks at fine granularity into linear projections and attention computation, allowing idle decoding GPU groups to participate in attention computation, further improving overall efficiency
  4. System Implementation and Verification: Implemented on OpenSoraPlan (2B parameters) and HunyuanVideo (13B parameters), with extensive experiments on 8-GPU systems demonstrating method effectiveness and scalability

Method Details

Task Definition

Video generation pipeline:

  • Input: Text prompts
  • Output: High-quality videos
  • Two-stage process:
    1. Denoising stage: Diffusion model iteratively refines latent representations across multiple timesteps
    2. Decoding stage: VAE decoder upsamples latent representations to full-resolution video

Model Architecture

1. PipeSP: Pipelined Sequence Parallelism

Original Ulysses problem:

  • All attention head computations complete before a single All-to-All communication is executed
  • GPUs remain idle while waiting for communication

PipeSP design (Algorithm 1):

For each attention head j ∈ [0, h-1]:
  1. Compute attention(Q[:,j,:,:], K[:,j,:,:], V[:,j,:,:])
  2. Record CUDA event marking computation completion
  3. Immediately trigger All-to-All communication after event completion
  4. Collect results

Post-processing alignment (resolving result misalignment):

  • Through sequence transformation view(-1, h, n, D) → permute(0, 2, 1, 3) → view(-1, nh, D)
  • Maps interleaved tensors to the head-contiguous layout expected by original Ulysses

Mathematical correctness proof: Define reshape mapping φ_{h,n} and permutation operation π, where composite mapping Ψ = φ^{-1}{h,n} ∘ π ∘ φ{h,n} satisfies:

(ΨT_mod)[b, k_orig(i,j), d] = T_mod[b, k_mod(i,j), d]

ensuring optimized results are identical to the original implementation.

2. DeDiVAE: Diffusion-VAE Module Decoupling

GPU grouping strategy:

  • Denoising group: N_denoise GPUs storing the diffusion backbone
  • Decoding group: N_decode = N - N_denoise GPUs storing the VAE decoder

Optimal GPU allocation: Based on first-order balance condition, equalizing execution time of both groups to maximize overlap:

N_decode ≈ ⌈(T_decode / (T_decode + T_denoise)) × N⌉

where T_denoise and T_decode are single-GPU denoising and decoding times respectively.

Multi-prompt pipelining:

  • Decoding of the first prompt executes in parallel with denoising of the second prompt
  • Latent representations are passed through shared queues, implementing producer-consumer pattern

3. Aco: Attention Cooperative Processing

Motivation: When denoising time far exceeds decoding time, decoding GPU groups remain idle most of the time

Fine-grained decomposition: Decompose DiT blocks into:

  • Linear projections: Q = XW_Q, K = XW_K, V = XW_V (executed by denoising group)
  • Attention kernel: Attn(Q,K,V) (can execute in parallel on decoding group)

Execution flow:

  • Prompt 1 phase (decoding queue empty):
    1. Denoising group computes Q,K,V and sends to decoding group via P2P communication
    2. Both groups execute attention computation in parallel
    3. Results aggregated through All-to-All and P2P communication
  • Prompt 2 phase (decoding queue non-empty):
    1. Denoising group executes attention computation independently
    2. Decoding group executes VAE decoding in parallel

Performance analysis: Theoretical speedup:

S = T_baseline / T_coop = (t_L + t_A) / (t_L + t_A × N_denoise/N)

where t_L and t_A are linear projection and attention computation times respectively.

Handling non-divisible attention heads:

  • OpenSoraPlan: Introduces head dimension padding to ensure load balancing
  • HunyuanVideo/Wan: Supports USP, allowing flexible switching between Ulysses and Ring-Attention degrees, avoiding padding overhead

Technical Innovations

  1. Computation-communication overlap: PipeSP achieves effective communication hiding in Ulysses for the first time through fine-grained head-level pipelining
  2. Module-level decoupling: DeDiVAE breaks traditional co-location design, achieving dual optimization of memory and computation through GPU group separation
  3. Dynamic resource scheduling: Aco dynamically utilizes idle GPU resources based on workload, avoiding efficiency loss from traditional static allocation
  4. Mathematical rigor: Provides formal correctness proof of PipeSP transformation, ensuring optimization does not alter computation results

Experimental Setup

Test Platforms

System 1: 8× NVIDIA RTX A6000 (48GB)

  • CPU: Intel Xeon Platinum 8358 @2.60GHz
  • Interconnect: NVLink (112.5GB/s, 4×)

System 2: 8× NVIDIA L40 (48GB)

  • CPU: Intel Xeon Platinum 8358 @2.60GHz
  • Interconnect: PCIe 4.0 (x16)

Baseline Models

  • OpenSoraPlan v1.3.0: 2B parameters, using Ulysses sequence parallelism
  • HunyuanVideo: 13B parameters, integrating xDiT's USP

Evaluation Metrics

  1. Single timestep latency: Measures PipeSP optimization effectiveness
  2. End-to-end latency: Total time for generating multiple videos, measuring overall PipeDiT optimization
  3. Peak GPU memory: Evaluates DeDiVAE memory optimization

Experimental Configuration

Resolution settings:

  • 480×352 (65/97/129 frames)
  • 640×352 (65/97/129 frames)
  • 800×592 (65/97/129 frames)
  • 1024×576 (65/97/129 frames)

Timestep counts: 10, 20, 30, 40, 50

Prompt quantities: 10 prompts (main experiments), additional configurations in supplementary materials

Comparison methods:

  • Baseline: Original implementation + offloading
  • PipeDiT (w/o Aco): PipeSP + DeDiVAE
  • PipeDiT (w/ Aco): Complete method

Experimental Results

Main Results

End-to-end Performance (Table 1)

OpenSoraPlan (A6000):

  • Maximum speedup: 480×352×97, 10 steps → 2.12× (227s → 107s)
  • High resolution: 1024×576×97, 50 steps → 1.18× (2162s → 1832s)
  • Trend: More significant speedup at lower resolutions, fewer frames, and shorter timesteps

HunyuanVideo (A6000):

  • Maximum speedup: 480×352×97, 10 steps → 3.27× (540s → 165s)
  • Large model advantage: Larger parameter count leads to higher offloading overhead, making PipeDiT optimization more effective
  • High resolution: 1024×576×97, 50 steps → 1.08× (3726s → 3453s)

Platform differences:

  • A6000 (NVLink) achieves higher speedup compared to L40 (PCIe)
  • Example: HunyuanVideo 480×352×97, 10 steps: A6000 3.27× vs L40 2.95×

Complete results in supplementary materials:

  • Maximum speedup reaches 4.02× (HunyuanVideo, 480×352×65, 10 steps)
  • Covers 12 resolutions × 5 timestep configurations, totaling 60 experiments

PipeSP Effectiveness (Table 2)

Optimal configuration: 640×352×129

  • OpenSoraPlan (A6000): 1.15× speedup (2.10s → 1.83s)
  • OpenSoraPlan (L40): 1.04× speedup (2.44s → 2.34s)

Performance characteristics:

  • Best results at medium resolutions (balancing computation and communication time)
  • Very low resolution: Communication overhead offsets gains
  • Very high resolution: Reduced communication proportion, lower optimization gains

Memory Optimization Results (Table 4)

OpenSoraPlan:

  • 1024×576×129: Baseline OOM → Offloading 28.3GB → DeDiVAE 28.1GB
  • 800×592×129: Baseline 39.8GB → DeDiVAE 18.6GB (53.3% reduction)
  • 480×352×129: Baseline 26.5GB → DeDiVAE 18.0GB (32.1% reduction)

HunyuanVideo:

  • All configurations baseline OOM
  • Offloading: 29.37-33.01GB (31.2-38.8% reduction)
  • DeDiVAE: 41.44-42.12GB (12.2-13.7% reduction)

Note: HunyuanVideo's DeDiVAE memory higher than offloading is because large text encoder is co-located with VAE decoder, demonstrating method's flexible adaptability.

Ablation Study (Table 3)

Component contribution analysis (OpenSoraPlan A6000, 30 steps):

Configuration480×352×65640×352×1291024×576×129
Baseline (A)314s (1×)665s (1×)1995s (1×)
+DeDiVAE (B)217s (1.45×)500s (1.33×)2138s (0.93×)
+PipeSP (C)200s (1.57×)509s (1.31×)1936s (1.03×)
+Aco (D)261s (1.20×)507s (1.31×)1690s (1.18×)

Key findings:

  1. DeDiVAE: Significant improvement at low resolutions, reduced effectiveness at high resolutions due to fewer denoising GPUs
  2. PipeSP: Pronounced effect on OpenSoraPlan (non-modular design allows more overlap)
  3. Aco: Significant improvement under high workload, compensating for DeDiVAE's limitations at high resolutions

Aco performance heatmap (Figure 5):

  • Shows latency differences between PipeDiT w/ Aco and w/o Aco
  • Aco provides significant improvements in high-workload configurations

Case Study

Generation result consistency verification (Figure 6):

  • Under identical prompts, configurations, and sampling frame indices
  • PipeDiT generation results are identical to original algorithm
  • Proves optimization does not affect generation quality

Experimental Findings

  1. Relationship between speedup and workload:
    • Low resolution + short timesteps → highest speedup (4.02×)
    • High resolution + long timesteps → still improved (1.06-1.18×)
    • Reason: Increased computation time proportion reduces relative offloading bottleneck impact
  2. Hardware interconnect impact:
    • NVLink (A6000) vs PCIe (L40): Former achieves higher speedup
    • High-bandwidth interconnect amplifies PipeSP communication hiding effect
  3. Model scale impact:
    • Large models (HunyuanVideo 13B) benefit more than small models (OpenSoraPlan 2B)
    • Reason: Offloading overhead scales with model size
  4. Future trend adaptation:
    • Current trend: Fewer timesteps + more aggressive VAE compression
    • Expectation: Reduced denoising time will further improve PipeDiT speedup
    • MoE architectures (e.g., Wan2.2): Larger models make offloading impractical, PipeDiT advantages more pronounced

Image Generation Optimization

DistriFusion:

  • Partitions input into multiple patches distributed across GPUs
  • Reuses intermediate feature maps from previous timestep for context
  • Hides communication overhead through asynchronous communication
  • Limitation: Designed for images, unsuitable for video's long sequences

PipeFusion:

  • Partitions images into patches and distributes network layers across GPUs
  • Addresses memory limitations during generation
  • Limitation: Layer-level parallelism unsuitable for video generation's sequence characteristics

Video Generation Optimization

Timestep reduction methods:

  • Teacache: Analyzes feature correlation between adjacent timesteps, reuses previous step output
  • DeepCache, Delta-DiT, FORA: Similar strategies reducing timestep count
  • Limitation: May introduce generation quality degradation

Sequence parallelism methods:

  • Ulysses (DeepSpeed): Partitions by attention heads, 3 All-to-All before + 1 after, but computation and communication are serial
  • Ring-Attention: Partitions by sequence, P2P communication, supports high parallelism but large overhead
  • USP (Unified SP): Combines both, flexible configuration but increases communication overhead
  • This paper's contribution: First to implement effective computation-communication pipelining in Ulysses

Memory Optimization

Offloading strategies:

  • HunyuanVideo, Wan, OpenSoraPlan all adopt this approach
  • Dynamically transfer model weights between CPU-GPU
  • Limitation: Significant transfer overhead, poor efficiency

DeDiVAE in this paper:

  • Module-level decoupling + GPU group separation
  • Avoids offloading overhead while reducing peak memory

System-level Optimization

LightSeq, FlexSP, LoongServe:

  • Sequence parallelism for long-context Transformers
  • Difference: This paper focuses on specific optimizations for video generation DiT

xDiT:

  • DiT inference engine integrating USP
  • This paper's contribution: Implements PipeDiT on its foundation, demonstrating method generality

Conclusions and Discussion

Main Conclusions

  1. PipeSP effectiveness: Achieves computation-communication overlap through head-level pipelining, improving single timestep latency by up to 15%
  2. DeDiVAE breakthrough: Module decoupling + GPU group separation reduces peak memory by up to 53.3%, enabling high-resolution generation
  3. Aco complementarity: Dynamic resource utilization compensates for DeDiVAE limitations under high load, achieving overall 1.06-4.02× speedup
  4. Generality verification: Effective on both 2B (OpenSoraPlan) and 13B (HunyuanVideo) parameter models
  5. Quality assurance: Optimization does not alter generation algorithm, output results identical to original implementation

Limitations

  1. Hardware dependency:
    • NVLink platforms outperform PCIe, sensitive to interconnect bandwidth
    • Requires multi-GPU systems (experiments use 8-GPU)
  2. Workload adaptability:
    • Very high resolution + long timesteps show reduced speedup (computation-dominated)
    • Aco may introduce overhead under low workload
  3. Attention head constraints:
    • Models not supporting USP require padding for non-divisible heads
    • May cause some GPUs to execute redundant computation
  4. Module co-location flexibility:
    • HunyuanVideo requires co-locating text encoder with VAE
    • Large encoders may offset some memory optimization gains
  5. Multi-prompt dependency:
    • DeDiVAE pipelining requires multiple concurrent queries for full overlap
    • Single-prompt scenarios may have idle GPUs

Future Directions

  1. Dynamic GPU allocation:
    • Adaptively adjust N_denoise and N_decode based on real-time workload
    • Consider optimal configurations for different resolutions and timesteps
  2. Extension to more parallelism dimensions:
    • Combine tensor parallelism and data parallelism
    • Support larger-scale models (100B+ parameters)
  3. Heterogeneous hardware support:
    • Adapt to mixed systems with different GPU types
    • Optimize communication strategies for PCIe interconnects
  4. MoE architecture optimization:
    • Specialized optimization for MoE models like Wan2.2
    • Handle load imbalance from expert routing
  5. End-to-end optimization:
    • Integrate text encoder optimization
    • Explore more aggressive VAE compression methods
  6. Automatic tuning framework:
    • Automatically search optimal hyperparameters based on hardware and model characteristics
    • Simplify user deployment process

In-depth Evaluation

Strengths

  1. Strong innovation:
    • PipeSP first implements effective computation-communication pipelining in Ulysses
    • DeDiVAE breaks traditional co-location paradigm, proposing novel module-level decoupling
    • Aco dynamic resource scheduling reflects deep system design thinking
  2. Theoretical rigor:
    • Provides formal mathematical proof of PipeSP transformation (supplementary materials)
    • Optimal GPU allocation based on theoretical derivation from first-order balance condition
    • Aco performance analysis provides clear speedup formula
  3. Comprehensive experiments:
    • Two models (2B and 13B parameters) × two platforms (A6000 and L40)
    • 12 resolutions × 5 timesteps = 60 configurations (complete results)
    • Detailed ablation studies analyzing component contributions
    • Generation result consistency verification ensures quality preservation
  4. High practical value:
    • Implemented on mainstream open-source frameworks, easy to reproduce and deploy
    • Significantly reduces memory consumption, enabling high-resolution generation
    • 1.06-4.02× speedup directly translates to reduced service costs
  5. Clear writing:
    • Complete logical structure, clear hierarchy from problem analysis to method design
    • Rich figures (flowcharts, performance graphs, heatmaps) enhance readability
    • Supplementary materials provide complete experimental data and theoretical proofs

Weaknesses

  1. Method limitations:
    • High hardware requirements: Needs multi-GPU systems and high-bandwidth interconnects
    • Load dependency: Reduced pipeline efficiency in single-prompt scenarios
    • Scalability: Ulysses limited by attention head count, switching to Ring-Attention increases complexity
  2. Experimental design flaws:
    • Lack of user studies: No evaluation of subjective generation quality perception
    • Single metric focus: Primarily addresses latency and memory, neglects energy consumption, throughput
    • Insufficient hardware coverage: Only tests 48GB GPUs, lacks verification on larger or smaller memory configurations
  3. Insufficient analysis depth:
    • Communication overhead details: Lacks detailed analysis of P2P vs All-to-All specific overhead
    • Load balancing: Doesn't discuss impact of non-uniform attention head distribution
    • Failure cases: Doesn't present scenarios where method is inapplicable
  4. Incomplete comparisons:
    • Missing recent methods: No comparison with latest optimization methods from 2024-2025
    • Single baseline: Only compares with offloading, lacks other memory optimization strategies (quantization, pruning)
  5. Reproducibility issues:
    • Code not open-sourced: No code link provided at publication time
    • Implementation details: Some details (e.g., event synchronization mechanism) insufficiently described

Impact

Contributions to the field:

  • Theoretical contribution: Proposes novel module-level decoupling system optimization paradigm
  • Practical contribution: Provides deployable acceleration solution for video generation services
  • Inspirational value: Fine-grained pipelining concepts generalizable to other multi-stage generation tasks

Potential impact:

  • Short-term: OpenSoraPlan and HunyuanVideo communities can directly adopt
  • Medium-term: Influences commercial video generation service architecture design
  • Long-term: Promotes DiT inference optimization as independent research direction

Citation prospects:

  • System optimization field: Important reference for multi-GPU inference optimization
  • Video generation field: Baseline acceleration method
  • Estimated 50-100 citations within 1-2 years

Applicable Scenarios

Best applicable scenarios:

  1. Multi-user video generation services:
    • High concurrent queries, high pipeline efficiency
    • Latency-sensitive, speedup directly improves user experience
  2. High-resolution video generation:
    • Memory-constrained scenarios, DeDiVAE advantages pronounced
    • Replaces inefficient offloading strategies
  3. NVLink multi-GPU systems:
    • High-bandwidth interconnect amplifies PipeSP effect
    • A100/H100 and other data center GPUs
  4. Large model inference:
    • 13B+ parameter models, significant offloading overhead
    • MoE architecture models

Inapplicable scenarios:

  1. Single-GPU inference: Method depends on multi-GPU parallelism
  2. Extremely low-resolution generation: Short computation time, small optimization gains
  3. Single-prompt batch processing: Pipeline cannot fully overlap
  4. PCIe interconnect + low workload: Communication overhead may offset gains

Deployment recommendations:

  • Evaluate workload: Concurrent query count, resolution distribution
  • Hardware configuration: Prioritize NVLink platforms
  • Parameter tuning: Adjust N_denoise/N_decode ratio based on model size
  • Monitor metrics: Latency, memory, GPU utilization

References

Key citations:

  1. Ulysses (Jacobs et al. 2023): DeepSpeed-Ulysses sequence parallelism foundation
  2. Ring-Attention (Li et al. 2021): Sequence dimension partitioning parallelism strategy
  3. USP (Fang & Zhao 2024): Unified sequence parallelism framework
  4. DistriFusion (Li et al. 2024b): Patch-level parallelism for image generation
  5. Teacache (Liu et al. 2025): Timestep feature reuse method
  6. OpenSoraPlan (PKU-YuanGroup 2025): Open-source video generation framework
  7. HunyuanVideo (Kong et al. 2024): Large-scale video generation model

Overall Assessment: This is a high-quality systems optimization paper addressing practical pain points in video generation DiT inference with innovative solutions. Three technical innovations work synergistically to form a complete optimization framework. Experimental design is comprehensive with convincing results. Main weaknesses are hardware dependency and limited depth in some analyses. Valuable reference for video generation service providers and systems optimization researchers. Authors are recommended to open-source code and verify long-term stability in production environments.