2025-11-13T08:31:10.865308

Classifier-Augmented Generation for Structured Workflow Prediction

Gschwind, Chakraborty, Gupta et al.

ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.

academic

Classifier-Augmented Generation for Structured Workflow Prediction

基本信息

论文ID: 2510.12825
标题: Classifier-Augmented Generation for Structured Workflow Prediction
作者: Thomas Gschwind, Shramona Chakraborty, Nitin Gupta, and Sameep Mehta (IBM Research)
分类: cs.CL cs.AI cs.DB cs.LG
发表时间: 2025年10月10日 (arXiv preprint)
论文链接: https://arxiv.org/abs/2510.12825

摘要

ETL (Extract, Transform, Load) 工具如IBM DataStage允许用户可视化地组装复杂的数据工作流，但配置阶段及其属性仍然耗时且需要深入的工具知识。本文提出了一个将自然语言描述转换为可执行工作流的系统，自动预测流程的结构和详细配置。核心是分类器增强生成(CAG)方法，将话语分解与分类器和阶段特定的少样本提示相结合，产生准确的阶段预测。这些阶段通过边预测连接成非线性工作流，并从子话语上下文推断阶段属性。与强基线方法相比，CAG显示出更高的准确性和效率，同时大幅减少token使用量。

研究背景与动机

问题定义

核心问题: ETL工具的配置复杂性阻碍了用户使用，即使是专家用户也需要手动配置转换阶段并指定每个阶段数十个低级属性，使得创作过程繁琐且容易出错。
重要性: ETL和ELT工作流是现代企业数据集成和分析管道的基础，但传统的图形界面仍需要大量手动配置工作。
现有方法局限性:
- 早期方法通过自定义脚本或基于GUI的简化来解决挑战
- 一些探索了语义和本体驱动的ETL生成
- 缺乏端到端的自然语言到可执行工作流的系统
研究动机: 大语言模型的进步为直接从自然语言自动合成工作流提供了新机会，可以减少配置开销并提高可访问性。

核心贡献

提出了Classifier-Augmented Generation (CAG)方法: 结合话语分解、基于分类的阶段检索和少样本提示来预测工作流阶段序列
构建了端到端的工作流生成系统: 包括阶段预测、边预测和属性预测三个核心模块
实现了显著的性能提升: 在阶段预测上达到97%以上准确率，同时减少60%以上的token使用
提供了模块化和可解释的架构: 支持鲁棒验证和约束检查
完成了生产环境部署: 系统已集成到IBM DataStage生产工具中