2025-11-21T03:37:14.946546

Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving

Pagonas, Chung, Kaffes et al.

We introduce Cortex, a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: it provisions dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in compute and memory, leading to better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of agentic workflows, Cortex lays the groundwork for more advanced, agent-native serving paradigms, including malleable resource management, speculative execution of workflow branches, and a shared, multi-tiered cache for "agentic state."

academic

Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving

Basic Information

Paper ID: 2510.14126
Title: Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
Authors: Nikos Pagonas (Columbia University), Yeounoh Chung (Google), Kostis Kaffes (Columbia University), Arvind Krishnamurthy (Google & University of Washington)
Classification: cs.DC (Distributed, Parallel, and Cluster Computing)
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14126

Abstract

This paper introduces Cortex, a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: providing dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in both computation and memory, thereby enabling better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of an agentic workflow, Cortex establishes the foundation for more advanced agent-native serving paradigms, including plastic resource management, speculative execution of workflow branches, and shared multi-tier caching for "agent state."

Research Background and Motivation

Problem Definition

Agentic workflows combine large language model (LLM) inference with iterative tool use: the model observes intermediate results, reasons, invokes another tool, and repeats until the task is solved or the budget is exhausted. This closed-loop pattern is increasingly important in production-level applications, such as natural language to SQL (NL2SQL) agents.

Limitations of Existing Approaches

Current LLM serving platforms suffer from the following issues:

Workflow Insensitivity: Popular LLM serving frameworks (such as vLLM) treat each stage as an independent LLM call, employing first-come-first-served (FCFS) scheduling
Lack of Structure Awareness: Existing agentic serving platforms (such as Autellix) employ complex priority strategies but lack understanding of internal workflow structure
Wasted Caching Opportunities: Five iterative attempts to improve a pattern generate five identical prompt constructions and five identical warm cache SQL executions
Scheduling Blindness: LLM calls are scheduled without knowledge of remaining workflow, ignoring downstream costs

Research Motivation

The authors observe that a single shared "general-purpose" LLM engine pool is unsuitable for agentic workflows containing heterogeneous stages. Each stage (SQL generation, execution, error correction) exhibits different latency profiles, memory requirements, and caching opportunities.

Core Contributions

Proposed the Cortex Architecture: The first workflow-aware serving platform based on stage isolation, providing dedicated engine pools for each workflow stage
Achieved Significant KV Cache Optimization: Substantially reduced KV cache memory usage through stage isolation, improving GPU memory utilization
Eliminated Cross-Stage Interference: Restored stable stage-local latency models, enhancing performance predictability
Designed an Agent-Native Serving Framework: Established the foundation for plastic workflows, speculative execution, and agent state management

Methodology Details

Task Definition

Using the NL2SQL workflow as an example, the input is a natural language query (e.g., "What were the sales in Europe last quarter?"), and the output is a successfully executed SQL query result. The workflow comprises:

Retrieving the target schema
Autoregressive generation of candidate queries
Query execution
Result set validation
If the query fails, correction and retry

Core Architecture Design

Stage Isolation Principle

Cortex provides dedicated engine pools for each workflow stage. An engine pool is a group of homogeneous workers (e.g., GPUs for LLM decoding or CPU executors for SQL execution), managed by a stage-local scheduler with its own queue, cache, and scaling policies.

System Components

Orchestrator:
- Workflow-aware, tracking each request's position in the graph
- Predicting the next set of eligible operators
- Attaching priority keys based on SLO slack, stage selectivity, and expected service time
Engine Allocation Layer:
- Routing subcalls to concrete pool instances that maximize locality
- Balancing load across replicas
- Reordering requests based on priority
- Performing admission control when a stage becomes a bottleneck
Resource Borrowing Mechanism: When load and memory pressure are sufficiently low, the orchestrator can opportunistically allow compatible stages to borrow idle engines to reduce fragmentation and improve utilization.

Technical Innovations

KV Cache Optimization

Through stage isolation, each engine maintains only its stage-specific context, whereas shared engines must maintain hot caches for both stages on each replica, effectively duplicating KV cache memory usage. Recovered GPU memory increases effective batch size, directly translating to higher throughput and tighter tail latency.

Performance Predictability

Stage isolation eliminates cross-stage interference that undermines predictability. When heterogeneous calls share engines, batching couples their runtimes, delaying token emission and making LLM call latency dependent on its batch companions.

Independent Scaling

Enables independent scaling and configuration: a lightweight monitor scales only pools threatening SLO, allowing lightweight configuration of one-off stage runs while allocating greater weight to critical path pools.

Experimental Setup

Experimental Scenarios

The paper employs the NL2SQL workflow as the primary experimental scenario, comprising two LLM stages:

SQL Generator
SQL Error Corrector
SQL Executor (non-LLM stage)

Evaluation Metrics

KV cache memory usage
Total memory footprint
System throughput
Tail latency

Baseline Comparisons

Shared Engine Pool: All stages share the same set of LLM engines
Cortex Stage Isolation: Each stage uses dedicated engine pools

Experimental Results

Primary Results

KV Cache Optimization Effects

Experimental results demonstrate that when running NL2SQL workflow LLM stages in Cortex, total KV occupancy is significantly reduced. When each stage runs in its own Cortex pool, the total KV footprint is noticeably lower: each engine maintains only its stage-specific context.

Performance Improvements

Memory Efficiency: Through stage isolation, KV cache duplication is avoided, freeing valuable GPU memory
Throughput Enhancement: Recovered GPU memory directly translates to higher effective batch sizes
Latency Improvement: Tighter tail latency and more predictable performance

System Advantage Verification

Experiments validate Cortex's three primary advantages:

Improved KV Cache Utilization: Significant reduction in memory footprint
Eliminated Cross-Stage Interference: Restored stable stage-local latency models
Independent Scaling Capability: Supporting fine-grained resource management

LLM Serving Frameworks

vLLM: Efficient large language model serving with PagedAttention for memory management
SGLang: Efficient execution of structured language model programs

Agentic Serving Platforms

Autellix: Efficient serving engine for LLM agents, employing complex priority strategies
HEXGEN-TEXT2SQL: NL2SQL workflow request scheduling based on remaining deadline slack and estimated execution time

Technical Differentiation

Existing platforms lack awareness of internal workflow structure; Cortex addresses this gap through stage isolation.

Conclusions and Discussion

Main Conclusions

Cortex significantly improves serving performance for agentic workloads through a simple yet effective stage isolation strategy. This approach not only enhances resource utilization efficiency but also establishes the foundation for more advanced agent-native serving paradigms.

Future Directions

Plastic Workflows and Resources

Computational Adaptivity: Replace heavyweight models with lightweight variants when latency approaches SLO boundaries
Resource Elasticity: Employ more powerful engines to boost stragglers in fan-out patterns

Speculative Execution

Speculate on the most probable branches in workflows
Warm up relevant engines or pre-execute next steps
Generate and evaluate multiple candidate queries in parallel

Agent State Management

Multi-tier "agent state" treating intermediate data as first-class citizens
Workflow-scoped shared layers as publish/subscribe structures
Transform repeated tool and LLM calls into zero-cost hits

Limitations

Prototype Stage: Currently remains a proof-of-concept, requiring more comprehensive implementation and evaluation
Scenario Constraints: Primarily exemplified by NL2SQL, requiring validation across more agentic workflows
Complexity Management: How to design interfaces allowing workflows to declare their plasticity remains an open challenge

In-Depth Evaluation

Strengths

Strong Innovation: First to propose workflow-aware agentic serving architecture
Accurate Problem Identification: Precisely identifies key issues in existing LLM serving platforms
Simple and Effective Solution: Stage isolation strategy is straightforward yet impactful
Forward-Looking: Provides clear development trajectory for future agent-native serving

Weaknesses

Limited Experimental Validation: Primarily based on a single NL2SQL scenario, lacking large-scale diverse experiments
Insufficient Quantitative Results: Charts display trends but lack specific performance improvement figures
Insufficient Implementation Details: Descriptions of scheduling algorithms and resource allocation strategies lack specificity
Incomplete Comparative Experiments: Primarily compared against simple shared pool schemes, lacking comparison with other advanced methods

Impact

Academic Value: Provides new research directions for the agentic serving field
Practical Value: Addresses important problems in actual production environments
Inspirational Value: Provides valuable insights for subsequent related research

Applicable Scenarios

Multi-Stage Agentic Workflows: Particularly suitable for agentic applications with clear stage divisions
Resource-Constrained Environments: Demonstrates significant effects in environments with limited resources such as GPU memory
High-Performance Requirement Scenarios: Production environments with strict latency and throughput requirements

References

The paper cites the following key literature:

vLLM: PagedAttention memory management mechanism
SGLang: Structured language model program execution
Autellix: LLM agentic serving engine
HEXGEN-TEXT2SQL: Agentic workflow scheduling
Related NL2SQL and cloud services literature

Overall Assessment: This is an innovative and forward-looking paper that identifies important problems in the agentic serving field and provides effective solutions. Although currently in the prototype stage, it charts a clear direction for field development and possesses significant academic and practical value.