We introduce Cortex, a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: it provisions dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in compute and memory, leading to better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of agentic workflows, Cortex lays the groundwork for more advanced, agent-native serving paradigms, including malleable resource management, speculative execution of workflow branches, and a shared, multi-tiered cache for "agentic state."
Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
- Paper ID: 2510.14126
- Title: Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
- Authors: Nikos Pagonas (Columbia University), Yeounoh Chung (Google), Kostis Kaffes (Columbia University), Arvind Krishnamurthy (Google & University of Washington)
- Classification: cs.DC (Distributed, Parallel, and Cluster Computing)
- Publication Date: October 15, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.14126
This paper introduces Cortex, a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: providing dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in both computation and memory, thereby enabling better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of an agentic workflow, Cortex establishes the foundation for more advanced agent-native serving paradigms, including plastic resource management, speculative execution of workflow branches, and shared multi-tier caching for "agent state."
Agentic workflows combine large language model (LLM) inference with iterative tool use: the model observes intermediate results, reasons, invokes another tool, and repeats until the task is solved or the budget is exhausted. This closed-loop pattern is increasingly important in production-level applications, such as natural language to SQL (NL2SQL) agents.
Current LLM serving platforms suffer from the following issues:
- Workflow Insensitivity: Popular LLM serving frameworks (such as vLLM) treat each stage as an independent LLM call, employing first-come-first-served (FCFS) scheduling
- Lack of Structure Awareness: Existing agentic serving platforms (such as Autellix) employ complex priority strategies but lack understanding of internal workflow structure
- Wasted Caching Opportunities: Five iterative attempts to improve a pattern generate five identical prompt constructions and five identical warm cache SQL executions
- Scheduling Blindness: LLM calls are scheduled without knowledge of remaining workflow, ignoring downstream costs
The authors observe that a single shared "general-purpose" LLM engine pool is unsuitable for agentic workflows containing heterogeneous stages. Each stage (SQL generation, execution, error correction) exhibits different latency profiles, memory requirements, and caching opportunities.
- Proposed the Cortex Architecture: The first workflow-aware serving platform based on stage isolation, providing dedicated engine pools for each workflow stage
- Achieved Significant KV Cache Optimization: Substantially reduced KV cache memory usage through stage isolation, improving GPU memory utilization
- Eliminated Cross-Stage Interference: Restored stable stage-local latency models, enhancing performance predictability
- Designed an Agent-Native Serving Framework: Established the foundation for plastic workflows, speculative execution, and agent state management
Using the NL2SQL workflow as an example, the input is a natural language query (e.g., "What were the sales in Europe last quarter?"), and the output is a successfully executed SQL query result. The workflow comprises:
- Retrieving the target schema
- Autoregressive generation of candidate queries
- Query execution
- Result set validation
- If the query fails, correction and retry
Cortex provides dedicated engine pools for each workflow stage. An engine pool is a group of homogeneous workers (e.g., GPUs for LLM decoding or CPU executors for SQL execution), managed by a stage-local scheduler with its own queue, cache, and scaling policies.
- Orchestrator:
- Workflow-aware, tracking each request's position in the graph
- Predicting the next set of eligible operators
- Attaching priority keys based on SLO slack, stage selectivity, and expected service time
- Engine Allocation Layer:
- Routing subcalls to concrete pool instances that maximize locality
- Balancing load across replicas
- Reordering requests based on priority
- Performing admission control when a stage becomes a bottleneck
- Resource Borrowing Mechanism:
When load and memory pressure are sufficiently low, the orchestrator can opportunistically allow compatible stages to borrow idle engines to reduce fragmentation and improve utilization.
Through stage isolation, each engine maintains only its stage-specific context, whereas shared engines must maintain hot caches for both stages on each replica, effectively duplicating KV cache memory usage. Recovered GPU memory increases effective batch size, directly translating to higher throughput and tighter tail latency.
Stage isolation eliminates cross-stage interference that undermines predictability. When heterogeneous calls share engines, batching couples their runtimes, delaying token emission and making LLM call latency dependent on its batch companions.
Enables independent scaling and configuration: a lightweight monitor scales only pools threatening SLO, allowing lightweight configuration of one-off stage runs while allocating greater weight to critical path pools.
The paper employs the NL2SQL workflow as the primary experimental scenario, comprising two LLM stages:
- SQL Generator
- SQL Error Corrector
- SQL Executor (non-LLM stage)
- KV cache memory usage
- Total memory footprint
- System throughput
- Tail latency
- Shared Engine Pool: All stages share the same set of LLM engines
- Cortex Stage Isolation: Each stage uses dedicated engine pools
Experimental results demonstrate that when running NL2SQL workflow LLM stages in Cortex, total KV occupancy is significantly reduced. When each stage runs in its own Cortex pool, the total KV footprint is noticeably lower: each engine maintains only its stage-specific context.
- Memory Efficiency: Through stage isolation, KV cache duplication is avoided, freeing valuable GPU memory
- Throughput Enhancement: Recovered GPU memory directly translates to higher effective batch sizes
- Latency Improvement: Tighter tail latency and more predictable performance
Experiments validate Cortex's three primary advantages:
- Improved KV Cache Utilization: Significant reduction in memory footprint
- Eliminated Cross-Stage Interference: Restored stable stage-local latency models
- Independent Scaling Capability: Supporting fine-grained resource management
- vLLM: Efficient large language model serving with PagedAttention for memory management
- SGLang: Efficient execution of structured language model programs
- Autellix: Efficient serving engine for LLM agents, employing complex priority strategies
- HEXGEN-TEXT2SQL: NL2SQL workflow request scheduling based on remaining deadline slack and estimated execution time
Existing platforms lack awareness of internal workflow structure; Cortex addresses this gap through stage isolation.
Cortex significantly improves serving performance for agentic workloads through a simple yet effective stage isolation strategy. This approach not only enhances resource utilization efficiency but also establishes the foundation for more advanced agent-native serving paradigms.
- Computational Adaptivity: Replace heavyweight models with lightweight variants when latency approaches SLO boundaries
- Resource Elasticity: Employ more powerful engines to boost stragglers in fan-out patterns
- Speculate on the most probable branches in workflows
- Warm up relevant engines or pre-execute next steps
- Generate and evaluate multiple candidate queries in parallel
- Multi-tier "agent state" treating intermediate data as first-class citizens
- Workflow-scoped shared layers as publish/subscribe structures
- Transform repeated tool and LLM calls into zero-cost hits
- Prototype Stage: Currently remains a proof-of-concept, requiring more comprehensive implementation and evaluation
- Scenario Constraints: Primarily exemplified by NL2SQL, requiring validation across more agentic workflows
- Complexity Management: How to design interfaces allowing workflows to declare their plasticity remains an open challenge
- Strong Innovation: First to propose workflow-aware agentic serving architecture
- Accurate Problem Identification: Precisely identifies key issues in existing LLM serving platforms
- Simple and Effective Solution: Stage isolation strategy is straightforward yet impactful
- Forward-Looking: Provides clear development trajectory for future agent-native serving
- Limited Experimental Validation: Primarily based on a single NL2SQL scenario, lacking large-scale diverse experiments
- Insufficient Quantitative Results: Charts display trends but lack specific performance improvement figures
- Insufficient Implementation Details: Descriptions of scheduling algorithms and resource allocation strategies lack specificity
- Incomplete Comparative Experiments: Primarily compared against simple shared pool schemes, lacking comparison with other advanced methods
- Academic Value: Provides new research directions for the agentic serving field
- Practical Value: Addresses important problems in actual production environments
- Inspirational Value: Provides valuable insights for subsequent related research
- Multi-Stage Agentic Workflows: Particularly suitable for agentic applications with clear stage divisions
- Resource-Constrained Environments: Demonstrates significant effects in environments with limited resources such as GPU memory
- High-Performance Requirement Scenarios: Production environments with strict latency and throughput requirements
The paper cites the following key literature:
- vLLM: PagedAttention memory management mechanism
- SGLang: Structured language model program execution
- Autellix: LLM agentic serving engine
- HEXGEN-TEXT2SQL: Agentic workflow scheduling
- Related NL2SQL and cloud services literature
Overall Assessment: This is an innovative and forward-looking paper that identifies important problems in the agentic serving field and provides effective solutions. Although currently in the prototype stage, it charts a clear direction for field development and possesses significant academic and practical value.