2025-11-24T10:40:17.913420

Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management

Phung, Thain

The rise of Generative AI introduces a new class of HPC workloads that integrates lightweight LLMs with traditional high-throughput applications to accelerate scientific discovery. The current design of HPC clusters is inadequate to support this new class however, either incurring long wait times on static batch queues or repeatedly paying expensive LLM startup costs upon resource preemption. To circumvent both the long queues and high startup costs, we propose to "decouple" the LLM initialization context from the actual LLM inferences, and retain the context in GPUs until it is no longer needed, a technique we term "Pervasive Context Management". We transform a fact verification application to enable this technique, allowing it to reduce its execution time by 72.1% (from 3 hours to 48 minutes) using the same amount of GPUs, and scale opportunistically on 32.8% of all GPUs in the cluster and further reduce the execution time to 13 minutes.

academic

Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management

基本信息

论文ID: 2510.14024
标题: Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management
作者: Thanh Son Phung, Douglas Thain (University of Notre Dame)
分类: cs.DC (Distributed Computing)
发表时间: 2025年 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.14024

摘要

生成式AI的兴起引入了一类新的HPC工作负载，将轻量级LLM与传统高吞吐量应用集成以加速科学发现。然而，当前HPC集群的设计无法充分支持这类新工作负载，要么在静态批处理队列中产生长等待时间，要么在资源抢占时反复承担昂贵的LLM启动成本。为了规避长队列和高启动成本，本文提出将LLM初始化上下文从实际LLM推理中"解耦"，并在GPU中保留上下文直到不再需要，这一技术被称为"普遍上下文管理"(Pervasive Context Management)。通过对事实验证应用的改造，该技术使执行时间减少了72.1%（从3小时减少到48分钟），并能在集群32.8%的GPU上机会性扩展，进一步将执行时间减少到13分钟。

研究背景与动机

问题定义

随着大语言模型(LLM)技术的快速发展，一类新的HPC工作负载正在兴起，它将轻量级LLM推理（通常具有数十亿参数）集成到传统的高吞吐量应用中。这类应用在蛋白质折叠、分布式AI驱动的科学计算等领域展现出巨大潜力。

核心挑战

静态分配模型的局限性：传统的静态GPU分配模型需要独占固定大小的GPU批次，导致严重的队列等待时间和集群资源利用率不足
机会性分配的启动成本：虽然机会性资源分配可以利用动态可用的GPU资源，但LLM的启动过程（加载数十亿参数模型从分布式文件系统到本地磁盘、主机内存，最终到GPU内存）是I/O密集型的，可能需要数分钟时间
资源抢占的代价：当任务被抢占时，整个昂贵的启动过程必须在新资源上重新执行，经常导致启动成本超过实际计算时间

现有方法的不足

自动扩缩容框架：基于主动原则设计，不适合被动的机会性HPC环境
传统容错技术：如检查点机制只能保护计算进度，无法解决模型加载成本问题

核心贡献

提出了普遍上下文管理技术：将LLM初始化上下文提升为集群中的一等持久实体，实现跨多个任务的重用
实现了基于Parsl-TaskVine框架的高吞吐量事实验证应用：展示了轻量级LLM在分布式数据密集型框架中的应用
设计了快速应用转换方法：通过简单的代码重构使应用支持上下文感知
验证了显著的性能提升：在相同GPU数量下执行时间减少72.1%，并能机会性扩展到集群32.8%的GPU

方法详解

任务定义

本研究针对高吞吐量轻量级LLM推理应用，特别是需要在异构机会性GPU集群上执行大量独立推理任务的场景。输入为大量推理请求，输出为推理结果，约束条件包括GPU资源的动态可用性和不可预测的抢占。

核心架构：普遍上下文管理

1. 整体设计理念

普遍上下文管理的核心思想是将昂贵的LLM上下文初始化从实际推理执行中解耦，使上下文成为可在集群节点间持久化和重用的一等实体。

2. 技术实现框架

基于Parsl-TaskVine集成框架：

Parsl：提供Python原生并行库，允许用户通过通用Python函数表达计算需求
TaskVine：低级数据密集型工作流执行引擎，处理任务间关系和调度优化

3. 上下文管理机制

# 传统方式（上下文无关）
@python_app
def infer(model_path, claims):
    model = AutoModel.from_pretrained(model_path).to('gpu')
    verdicts = [model.generate(claim) for claim in claims]
    return verdicts

# 改进方式（上下文感知）
def load_model(model_path):
    model = AutoModel.from_pretrained(model_path).to('gpu')
    return {'model': model}

@python_app
def infer_model(claims, parsl_spec):
    model = load_variable_from_serverless('model')
    verdicts = [model.generate(claim) for claim in claims]
    return verdicts