Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management
Phung, Thain
The rise of Generative AI introduces a new class of HPC workloads that integrates lightweight LLMs with traditional high-throughput applications to accelerate scientific discovery. The current design of HPC clusters is inadequate to support this new class however, either incurring long wait times on static batch queues or repeatedly paying expensive LLM startup costs upon resource preemption. To circumvent both the long queues and high startup costs, we propose to "decouple" the LLM initialization context from the actual LLM inferences, and retain the context in GPUs until it is no longer needed, a technique we term "Pervasive Context Management". We transform a fact verification application to enable this technique, allowing it to reduce its execution time by 72.1% (from 3 hours to 48 minutes) using the same amount of GPUs, and scale opportunistically on 32.8% of all GPUs in the cluster and further reduce the execution time to 13 minutes.
academic
Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management
The rise of generative AI has introduced a new class of HPC workloads that integrate lightweight LLMs with traditional high-throughput applications to accelerate scientific discovery. However, current HPC cluster designs inadequately support these workloads, either incurring long waiting times in static batch queues or repeatedly bearing expensive LLM initialization costs upon resource preemption. To circumvent long queues and high startup costs, this paper proposes "decoupling" LLM initialization context from actual LLM inference and retaining the context in GPUs until no longer needed, a technique termed "Pervasive Context Management." Through adaptation of a fact verification application, this technique reduces execution time by 72.1% (from 3 hours to 48 minutes) and enables opportunistic scaling across 32.8% of cluster GPUs, further reducing execution time to 13 minutes.
With the rapid development of large language model (LLM) technology, a new class of HPC workloads is emerging that integrates lightweight LLM inference (typically with billions of parameters) into traditional high-throughput applications. Such applications demonstrate tremendous potential in domains such as protein folding and distributed AI-driven scientific computing.
Limitations of Static Allocation Models: Traditional static GPU allocation models require exclusive fixed-size GPU batches, resulting in severe queue waiting times and insufficient cluster resource utilization
Startup Costs of Opportunistic Allocation: While opportunistic resource allocation can leverage dynamically available GPU resources, the LLM startup process (loading billions of parameters from distributed file systems to local disk, host memory, and finally GPU memory) is I/O intensive and may require several minutes
Cost of Resource Preemption: When tasks are preempted, the entire expensive startup process must be re-executed on new resources, often resulting in startup costs exceeding actual computation time
Proposed Pervasive Context Management Technique: Elevates LLM initialization context to a first-class persistent entity in the cluster, enabling reuse across multiple tasks
Implemented High-throughput Fact Verification Application Based on Parsl-TaskVine Framework: Demonstrates the application of lightweight LLMs in distributed data-intensive frameworks
Designed Rapid Application Transformation Method: Enables applications to support context awareness through simple code refactoring
Verified Significant Performance Improvements: Reduces execution time by 72.1% with the same number of GPUs and enables opportunistic scaling to 32.8% of cluster GPUs
The core idea of Pervasive Context Management is to decouple expensive LLM context initialization from actual inference execution, making context a first-class entity that can be persisted and reused across cluster nodes.
Full-context version shows only 13.6% execution time variation across different batch sizes, while Partial-context version execution time surges to 141,100 seconds at batch size 1, demonstrating extreme sensitivity.
Spot instances in cloud computing provide similar opportunistic computing models but typically offer 30-120 seconds preemption warning, whereas preemption in HPC environments is often instantaneous, rendering traditional state-saving mechanisms ineffective.
Evolving from traditional resource managers to modern Python-native workflow systems, the Parsl-TaskVine integration in this paper represents a new direction supporting computational context sharing.
The paper cites 61 related references covering important works in LLM technology, HPC scheduling, workflow systems, and other domains, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality research paper addressing emerging AI workloads in HPC environments. The authors accurately identify practical problems, propose innovative solutions, and comprehensively validate method effectiveness through experiments. While certain limitations exist in applicable scope and theoretical analysis, the work provides valuable contributions to related research and practice.