We introduce Cortex, a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: it provisions dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in compute and memory, leading to better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of agentic workflows, Cortex lays the groundwork for more advanced, agent-native serving paradigms, including malleable resource management, speculative execution of workflow branches, and a shared, multi-tiered cache for "agentic state."
- è«æID: 2510.14126
- ã¿ã€ãã«: Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
- èè
: Nikos Pagonas (ã³ãã³ãã¢å€§åŠ)ãYeounoh Chung (Google)ãKostis Kaffes (ã³ãã³ãã¢å€§åŠ)ãArvind Krishnamurthy (Google & ã¯ã·ã³ãã³å€§åŠ)
- åé¡: cs.DC (忣ã»äžŠåã»ã¯ã©ã¹ã¿ã³ã³ãã¥ãŒãã£ã³ã°)
- çºè¡šæ¥: 2025幎10æ15æ¥ (arXivãã¬ããªã³ã)
- è«æãªã³ã¯: https://arxiv.org/abs/2510.14126
æ¬è«æã§ã¯Cortexã玹ä»ãããããã¯ãšãŒãžã§ã³ãåã¯ãŒã¯ããŒãåãã«èšèšãããã¯ãŒã¯ãããŒèªèåãµãŒãã³ã°ãã©ãããã©ãŒã ã®ãããã¿ã€ãã§ãããCortexã®äžæ žåçã¯æ®µééé¢ã§ããïŒãšãŒãžã§ã³ãåã¯ãŒã¯ãããŒã®å段éã«å°çšãªãœãŒã¹ããŒã«ãæäŸããããã®åçŽãã€åŒ·åãªæŠç¥ã¯ãèšç®ãšã¡ã¢ãªã«ãããæ®µééå¹²æžã軜æžããããåªããKVãã£ãã·ã¥å©çšçãããé«ãã¹ã«ãŒããããããã³ããäºæž¬å¯èœãªããã©ãŒãã³ã¹ãå®çŸããããšãŒãžã§ã³ãåã¯ãŒã¯ãããŒã®å段éå
ã§ãªãœãŒã¹å²ãåœãŠãšã¹ã±ãžã¥ãŒãªã³ã°ãã«ã¹ã¿ãã€ãºããããšã«ãããCortexã¯ããé«åºŠãªãšãŒãžã§ã³ãåãã€ãã£ããµãŒãã³ã°ãã©ãã€ã ã®åºç€ã確ç«ãããããã«ã¯ãå¯å¡çãªãœãŒã¹ç®¡çãã¯ãŒã¯ãããŒåå²ã®æšæž¬å®è¡ãããã³ããšãŒãžã§ã³ãç¶æ
ãçšã®å
±æå€å±€ãã£ãã·ã¥ãå«ãŸããã
ãšãŒãžã§ã³ãåã¯ãŒã¯ãããŒã¯å€§èŠæš¡èšèªã¢ãã«(LLM)ã®æšè«ãšå埩çãªããŒã«äœ¿çšãçµã¿åãããïŒã¢ãã«ã¯äžéçµæã芳å¯ããæèããå¥ã®ããŒã«ãåŒã³åºããã¿ã¹ã¯ã解決ããããäºç®ãå°œãããŸã§ç¹°ãè¿ãããã®ã¯ããŒãºãã«ãŒãã¢ãŒã ã¯ãèªç¶èšèªããSQL(NL2SQL)ãšãŒãžã§ã³ããªã©ãæ¬çªã¬ãã«ã®ã¢ããªã±ãŒã·ã§ã³ã§ãŸããŸãéèŠã«ãªã£ãŠããã
çŸåšã®LLMãµãŒãã³ã°ãã©ãããã©ãŒã ã«ã¯ä»¥äžã®åé¡ãããïŒ
- ã¯ãŒã¯ãããŒéèªèæ§: äžè¬çãªLLMãµãŒãã³ã°ãã¬ãŒã ã¯ãŒã¯(vLLMãªã©)ã¯å段éãç¬ç«ããLLMåŒã³åºããšããŠæ±ããå
çé (FCFS)ã¹ã±ãžã¥ãŒãªã³ã°ãæ¡çšããŠãã
- æ§é èªèã®æ¬ åŠ: æ¢åã®ãšãŒãžã§ã³ãåãµãŒãã³ã°ãã©ãããã©ãŒã (Autellixãªã©)ã¯è€éãªåªå
床æŠç¥ã䜿çšããããå
éšã¯ãŒã¯ãããŒæ§é ãçè§£ããŠããªã
- ãã£ãã·ã¥æ©äŒã®æµªè²»: åããã¿ãŒã³ã«å¯Ÿãã5åã®æ¹å詊è¡ã¯ã5åã®åäžããã³ããæ§ç¯ãš5åã®åäžã®ããããã£ãã·ã¥SQLå®è¡ãçæãã
- ã¹ã±ãžã¥ãŒãªã³ã°ã®ç²ç®æ§: æ®ãã®ã¯ãŒã¯ãããŒãçè§£ããã«LLMåŒã³åºããã¹ã±ãžã¥ãŒã«ããäžæµã³ã¹ããç¡èŠãã
èè
ãã¯ãç°ç𮿮µéãå«ããšãŒãžã§ã³ãåã¯ãŒã¯ãããŒã«ã¯ãåäžã®å
±æãæ±çšãLLMãšã³ãžã³ããŒã«ãé©åã§ãªãããšã芳å¯ãããåæ®µé(SQLçæãå®è¡ããšã©ãŒä¿®æ£)ã¯ç°ãªãã¬ã€ãã³ã·ãããã¡ã€ã«ãã¡ã¢ãªèŠä»¶ãããã³ãã£ãã·ã¥æ©äŒãæã€ã
- Cortexã¢ãŒããã¯ãã£ã®ææ¡: 段ééé¢ã«åºã¥ãæåã®ã¯ãŒã¯ãããŒèªèåãµãŒãã³ã°ãã©ãããã©ãŒã ãåã¯ãŒã¯ãããŒæ®µéã«å°çšãšã³ãžã³ããŒã«ãæäŸãã
- é¡èãªKVãã£ãã·ã¥æé©åã®å®çŸ: 段ééé¢ãéããŠKVãã£ãã·ã¥ã¡ã¢ãªäœ¿çšéã倧å¹
ã«åæžããGPU ã¡ã¢ãªå©çšçãåäžããã
- 段ééå¹²æžã®æé€: å®å®ããæ®µéããŒã«ã«ã¬ã€ãã³ã·ã¢ãã«ã埩å
ããããã©ãŒãã³ã¹äºæž¬å¯èœæ§ãåäžããã
- ãšãŒãžã§ã³ãåãã€ãã£ããµãŒãã³ã°ãã¬ãŒã ã¯ãŒã¯ã®èšèš: å¯å¡çã¯ãŒã¯ãããŒãæšæž¬å®è¡ãããã³ãšãŒãžã§ã³ãç¶æ
管çã®åºç€ã確ç«ãã
NL2SQLã¯ãŒã¯ãããŒãäŸãšããŠãå
¥åã¯èªç¶èšèªã¯ãšãª(äŸïŒããšãŒãããã®åååæã®å£²äžã¯ãããã§ããïŒã)ã§ãããåºåã¯æ£åžžã«å®è¡ãããSQLã¯ãšãªçµæã§ãããã¯ãŒã¯ãããŒã«ã¯ä»¥äžãå«ãŸããïŒ
- ã¿ãŒã²ããã¹ããŒãã®ååŸ
- åè£ã¯ãšãªã®èªå·±ååž°çæ
- ã¯ãšãªã®å®è¡
- çµæã»ããã®æ€èšŒ
- ã¯ãšãªã倱æããå Žåãä¿®æ£ãšå詊è¡
Cortexã¯åã¯ãŒã¯ãããŒæ®µéã«å°çšãšã³ãžã³ããŒã«ãæäŸããããšã³ãžã³ããŒã«ã¯å質ãªã¯ãŒã«ãŒã®ã°ã«ãŒã(äŸïŒLLMãã³ãŒãçšã®GPUãŸãã¯SQLå®è¡çšã®CPUãšã°ãŒãã¥ãŒã¿ãŒ)ã§ãããç¬èªã®ãã¥ãŒããã£ãã·ã¥ãããã³ã¹ã±ãŒãªã³ã°æŠç¥ãæã€æ®µéããŒã«ã«ã¹ã±ãžã¥ãŒã©ãŒã«ãã£ãŠç®¡çãããã
- ãªãŒã±ã¹ãã¬ãŒã¿ãŒ(Orchestrator):
- ã¯ãŒã¯ãããŒèªèã§ãåãªã¯ãšã¹ããã°ã©ãå
ã®ã©ãã«ãããã远跡ãã
- 次ã®é©æ Œãªãã¬ãŒã¿ãŒã»ãããäºæž¬ãã
- SLOã¹ã©ãã¯ã段ééžææ§ãããã³äºæ³ãµãŒãã¹æéã«åºã¥ããŠåªå
床ããŒãä»å ãã
- ãšã³ãžã³å²ãåœãŠå±€(Engine Allocation Layer):
- ãµãã³ãŒã«ãããŒã«ãªãã£ãæå€§åããå
·äœçãªããŒã«ã€ã³ã¹ã¿ã³ã¹ã«ã«ãŒãã£ã³ã°ãã
- ã¬ããªã«éã§è² è·ãåè¡¡ããã
- åªå
床ã«åºã¥ããŠãªã¯ãšã¹ããäžŠã¹æ¿ãã
- 段éãããã«ããã¯ã«ãªã£ãå Žåãã¢ãããã·ã§ã³å¶åŸ¡ãå®è¡ãã
- ãªãœãŒã¹åçšã¡ã«ããºã :
è² è·ãšã¡ã¢ãªå§åãååã«äœãå ŽåããªãŒã±ã¹ãã¬ãŒã¿ãŒã¯äºææ§ã®ããæ®µéãæçåãæžããå©çšçãåäžãããããã«ãã¢ã€ãã«ãšã³ãžã³ãåçšããããšãæ©äŒçã«èš±å¯ã§ããã
段ééé¢ãéããŠãåãšã³ãžã³ã¯ãã®æ®µéç¹å®ã®ã³ã³ããã¹ãã®ã¿ãä¿æãããäžæ¹ãå
±æãšã³ãžã³ã¯åã¬ããªã«ã§2ã€ã®æ®µéã®ã³ã³ããã¹ããããããã£ãã·ã¥ã§ä¿æããå¿
èŠããããäºå®äžKVãã£ãã·ã¥ã¡ã¢ãªäœ¿çšéãéè€ããããååãããGPUã¡ã¢ãªã¯æå¹ããããµã€ãºãåäžãããããé«ãã¹ã«ãŒããããšããå³ããããŒã«é
å»¶ã«çŽçµããã
段ééé¢ã¯äºæž¬å¯èœæ§ãæãªã段ééå¹²æžãæé€ãããç°çš®åŒã³åºãããšã³ãžã³ãå
±æããå Žåããããã¯ãããã®å®è¡æéãçµåããããŒã¯ã³çºè¡ãé
å»¶ãããLLMåŒã³åºãã®ã¬ã€ãã³ã·ããã®ãããããŒãããŒã«äŸåãããã
ç¬ç«ããã¹ã±ãŒãªã³ã°ãšæ§æãå¯èœã«ããïŒé«éã¢ãã¿ãŒã¯SLOãè
ããããŒã«ã®ã¿ãã¹ã±ãŒã«ããè»œéæ§æãã¯ã³ã·ã§ããå®è¡æ®µéã«èš±å¯ããªãããéèŠãã¹ ããŒã«ã«ããå€ãã®éã¿ãå²ãåœãŠãã
è«æã¯äž»èŠãªå®éšã·ããªãªãšããŠNL2SQLã¯ãŒã¯ãããŒã䜿çšãã2ã€ã®LLM段éãå«ãïŒ
- SQLãžã§ãã¬ãŒã¿ãŒ
- SQLãšã©ãŒä¿®æ£åš
- SQLãšã°ãŒãã¥ãŒã¿ãŒ(éLLM段é)
- KVãã£ãã·ã¥ã¡ã¢ãªäœ¿çšé
- ç·ã¡ã¢ãªãããããªã³ã
- ã·ã¹ãã ã¹ã«ãŒããã
- ããŒã«é
å»¶
- å
±æãšã³ãžã³ããŒã«æ¹åŒïŒãã¹ãŠã®æ®µéãåãLLMãšã³ãžã³ã»ãããå
±æ
- Cortex段éé颿¹åŒïŒå段éãå°çšãšã³ãžã³ããŒã«ã䜿çš
å®éšçµæã¯ãCortexã§NL2SQLã¯ãŒã¯ãããŒã®LLM段éãå®è¡ããå Žåãç·KVå æã倧å¹
ã«åæžãããããšã瀺ããŠãããåæ®µéãç¬èªã®CortexããŒã«ã§å®è¡ãããå Žåãç·KVãããããªã³ãã¯æããã«äœãïŒåãšã³ãžã³ã¯ãã®æ®µéç¹å®ã®ã³ã³ããã¹ãã®ã¿ãä¿æããã
- ã¡ã¢ãªå¹ç: 段ééé¢ãéããŠãKVãã£ãã·ã¥ã®éè€ãåé¿ãã貎éãªGPUã¡ã¢ãªãè§£æŸãã
- ã¹ã«ãŒãããåäž: ååãããGPUã¡ã¢ãªã¯ãããé«ãæå¹ããããµã€ãºã«çŽçµãã
- é
å»¶æ¹å: ããå³ããããŒã«é
å»¶ãšããäºæž¬å¯èœãªããã©ãŒãã³ã¹
å®éšã¯Cortexã®3ã€ã®äž»èŠãªå©ç¹ãæ€èšŒããïŒ
- æ¹åãããKVãã£ãã·ã¥å©çšç: ã¡ã¢ãªå æã®å€§å¹
ãªåæž
- 段ééå¹²æžã®æé€: å®å®ããæ®µéããŒã«ã«ã¬ã€ãã³ã·ã¢ãã«ã®åŸ©å
- ç¬ç«ã¹ã±ãŒãªã³ã°èœå: 现ç²åºŠãªãœãŒã¹ç®¡çã®ãµããŒã
- vLLM: å¹ççãªå€§èŠæš¡èšèªã¢ãã«ãµãŒãã³ã°ãPagedAttentionã«ããã¡ã¢ãªç®¡ç
- SGLang: æ§é åèšèªã¢ãã«ããã°ã©ã ã®å¹ççãªå®è¡
- Autellix: LLMãšãŒãžã§ã³ãåãã®å¹ççãªãµãŒãã³ã°ãšã³ãžã³ãè€éãªåªå
床æŠç¥ã䜿çš
- HEXGEN-TEXT2SQL: æ®ãã®æéã¹ã©ãã¯ãšæšå®å®è¡æéã«åºã¥ãNL2SQLã¯ãŒã¯ãããŒèŠæ±ã¹ã±ãžã¥ãŒãªã³ã°
æ¢åãã©ãããã©ãŒã ã¯å
éšã¯ãŒã¯ãããŒæ§é ã®èªèãæ¬ ããŠãããCortexã¯ãã®ã®ã£ãããæ®µééé¢ã§åããã
Cortexã¯åçŽãã€å¹æçãªæ®µéé颿Šç¥ãéããŠããšãŒãžã§ã³ãåã¯ãŒã¯ããŒãã®ãµãŒãã³ã°ããã©ãŒãã³ã¹ã倧å¹
ã«æ¹åããããã®ã¢ãããŒãã¯ããªãœãŒã¹å©çšå¹çãåäžãããã ãã§ãªããããé«åºŠãªãšãŒãžã§ã³ãåãã€ãã£ããµãŒãã³ã°ãã©ãã€ã ã®åºç€ã確ç«ããã
- èšç®é©å¿æ§: ã¬ã€ãã³ã·ãSLOå¢çã«è¿ã¥ããå ŽåãééçŽã¢ãã«ã軜éããªã¢ã³ãã§çœ®ãæãã
- ãªãœãŒã¹åŒŸæ§: ãã¡ã³ã¢ãŠããã¿ãŒã³ã§ãã匷åãªãšã³ãžã³ã䜿çšããŠé
å»¶è
ãæ¯æŽãã
- ã¯ãŒã¯ãããŒå
ã®æãå¯èœæ§ã®é«ãåå²ã«å¯Ÿããæšæž¬
- é¢é£ãšã³ãžã³ã®ããªãŠã©ãŒãã³ã°ãŸãã¯æ¬¡ã®ã¹ãããã®äºåå®è¡
- è€æ°ã®åè£ã¯ãšãªã®äžŠåçæãšè©äŸ¡
- äžéããŒã¿ãäžçåžæ°ãšããŠå€å±€ããšãŒãžã§ã³ãç¶æ
ã
- ã¯ãŒã¯ãããŒç¯å²ã®å
±æå±€ããããªãã·ã¥/ãµãã¹ã¯ã©ã€ãæ§é ãšããŠ
- å埩çãªããŒã«ããã³LLMåŒã³åºãããŒãã³ã¹ããããã«å€æãã
- ãããã¿ã€ã段é: çŸåšã¯æŠå¿µå®èšŒã®ãŸãŸã§ãããããå
æ¬çãªå®è£
ãšè©äŸ¡ãå¿
èŠ
- ã·ããªãªå¶é: äž»ã«NL2SQLãäŸãšããŠãããããå€ãã®ãšãŒãžã§ã³ãåã¯ãŒã¯ãããŒäžã§ã®æ€èšŒãå¿
èŠ
- è€éæ§ç®¡ç: ã¯ãŒã¯ãããŒããã®å¯å¡æ§ã宣èšã§ãããããªã€ã³ã¿ãŒãã§ãŒã¹ã®èšè𿹿³ã¯äŸç¶ãšããŠæªè§£æ±ºã®èª²é¡
- 驿°æ§ãé«ã: ã¯ãŒã¯ãããŒèªèåãšãŒãžã§ã³ãåãµãŒãã³ã°ã¢ãŒããã¯ãã£ãåããŠææ¡
- åé¡ã®äœçœ®ä»ããæ£ç¢º: æ¢åLLMãµãŒãã³ã°ãã©ãããã©ãŒã ã®éèŠãªåé¡ãæ£ç¢ºã«ç¹å®
- ãœãªã¥ãŒã·ã§ã³ãç°¡æœãã€å¹æç: 段éé颿Šç¥ã¯åçŽã ã广ã¯é¡è
- å
èŠæ§ã匷ã: å°æ¥ã®ãšãŒãžã§ã³ãåãã€ãã£ããµãŒãã³ã°ã®æç¢ºãªçºå±çµè·¯ãæäŸ
- å®éšæ€èšŒãéå®ç: äž»ã«NL2SQLã·ããªãªã«åºã¥ããŠãããå€§èŠæš¡ã§å€æ§ãªå®éšãäžè¶³
- å®éççµæãäžåå: ã°ã©ãã¯ãã¬ã³ãã瀺ãããå
·äœçãªããã©ãŒãã³ã¹åäžå€ãäžè¶³
- å®è£
詳现ãäžåå: ã¹ã±ãžã¥ãŒãªã³ã°ã¢ã«ãŽãªãºã ãšãªãœãŒã¹å²ãåœãŠæŠç¥ã®å
·äœçãªå®è£
説æãå°ãªã
- æ¯èŒå®éšãäžåå: äž»ã«åçŽãªå
±æããŒã«æ¹åŒãšã®æ¯èŒã§ãããä»ã®å
é²çææ³ãšã®æ¯èŒãäžè¶³
- åŠè¡ç䟡å€: ãšãŒãžã§ã³ãåãµãŒãã³ã°åéã«æ°ããç ç©¶æ¹åãæäŸ
- å®çšç䟡å€: å®éã®æ¬çªç°å¢ã«ãããéèŠãªåé¡ã解決
- åçºæ§: åŸç¶ã®é¢é£ç ç©¶ã«æäŸ¡å€ãªæŽå¯ãæäŸ
- 倿®µéãšãŒãžã§ã³ãåã¯ãŒã¯ãããŒ: ç¹ã«æç¢ºãªæ®µéåºåãæã€ãšãŒãžã§ã³ãåã¢ããªã±ãŒã·ã§ã³ã«é©ããŠãã
- ãªãœãŒã¹å¶çŽç°å¢: GPU ã¡ã¢ãªãªã©ã®ãªãœãŒã¹ãéå®ãããç°å¢ã§é¡èãªå¹æ
- é«ããã©ãŒãã³ã¹èŠä»¶ã·ããªãª: ã¬ã€ãã³ã·ãšã¹ã«ãŒãããã«å³æ ŒãªèŠä»¶ãããæ¬çªç°å¢
è«æã¯ä»¥äžã®äž»èŠæç®ãåŒçšããŠããïŒ
- vLLM: PagedAttentionã¡ã¢ãªç®¡çã¡ã«ããºã
- SGLang: æ§é åèšèªã¢ãã«ããã°ã©ã å®è¡
- Autellix: LLMãšãŒãžã§ã³ããµãŒãã³ã°ãšã³ãžã³
- HEXGEN-TEXT2SQL: ãšãŒãžã§ã³ãåã¯ãŒã¯ãããŒã¹ã±ãžã¥ãŒãªã³ã°
- é¢é£ããNL2SQLããã³ã¯ã©ãŠããµãŒãã¹æç®
ç·åè©äŸ¡: ããã¯ãšãŒãžã§ã³ãåãµãŒãã³ã°åéã«ãããéèŠãªåé¡ãææ¡ãã广çãªãœãªã¥ãŒã·ã§ã³ãæäŸããã驿°æ§ãšå
èŠæ§ã«å¯ãã è«æã§ãããçŸåšã¯ãããã¿ã€ã段éã«ãããããã®åéã®çºå±ã«æ¹åæ§ã瀺ããŠãããåŠè¡çããã³å®çšç䟡å€ãé«ãã