2025-11-13T08:31:10.865308

Classifier-Augmented Generation for Structured Workflow Prediction

Gschwind, Chakraborty, Gupta et al.

ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.

academic

Classifier-Augmented Generation for Structured Workflow Prediction

Basic Information

Paper ID: 2510.12825
Title: Classifier-Augmented Generation for Structured Workflow Prediction
Authors: Thomas Gschwind, Shramona Chakraborty, Nitin Gupta, and Sameep Mehta (IBM Research)
Classification: cs.CL cs.AI cs.DB cs.LG
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12825

Abstract

ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but the configuration phase and its attributes remain time-consuming and require deep tool knowledge. This paper proposes a system that converts natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of processes. At its core is the Classifier-Augmented Generation (CAG) method, which combines utterance decomposition with classifiers and stage-specific few-shot prompting to produce accurate stage predictions. These stages are connected into non-linear workflows through edge prediction, and stage attributes are inferred from sub-utterance contexts. Compared to strong baseline methods, CAG demonstrates higher accuracy and efficiency while significantly reducing token usage.

Research Background and Motivation

Problem Definition

Core Problem: The configuration complexity of ETL tools hinders user adoption. Even expert users must manually configure transformation stages and specify dozens of low-level attributes per stage, making the authoring process tedious and error-prone.
Significance: ETL and ELT workflows form the foundation of modern enterprise data integration and analytics pipelines, yet traditional graphical interfaces still require substantial manual configuration work.
Limitations of Existing Approaches:
- Early methods addressed challenges through custom scripts or simplified GUI-based approaches
- Some explored semantic and ontology-driven ETL generation
- Lack of end-to-end systems for natural language to executable workflow conversion
Research Motivation: Advances in large language models provide new opportunities for directly synthesizing workflows from natural language, potentially reducing configuration overhead and improving accessibility.

Core Contributions

Proposed the Classifier-Augmented Generation (CAG) method: Combines utterance decomposition, classifier-based stage retrieval, and few-shot prompting to predict workflow stage sequences
Developed an end-to-end workflow generation system: Comprising three core modules for stage prediction, edge prediction, and attribute prediction
Achieved significant performance improvements: Attaining over 97% accuracy on stage prediction while reducing token usage by over 60%
Provided a modular and interpretable architecture: Supporting robust validation and constraint checking
Completed production environment deployment: The system has been integrated into the IBM DataStage production tool

Methodology Details

Task Definition

Input: Natural language description of ETL workflow requirements Output: Complete executable DataStage workflow, including:

Workflow stage sequence
Connections between stages (edges)
Detailed property configuration for each stage

Model Architecture

1. Stage Prediction

The CAG method comprises the following steps:

Utterance Decomposition: Decomposes user input into sub-utterances describing individual stages
Classifier Retrieval: Uses a trained classification model to identify candidate stages
Keyword Matching: Scans user utterances for stage names and their synonyms
Targeted Generation: Generates targeted descriptions and few-shot examples based on candidate stages for final multi-label prediction by the LLM

2. Edge Prediction

Handles non-linear workflow structures:

Assigns unique names to repeated stages
Segments utterances into sub-utterances based on predicted stages
Predicts flow structure based on node lists and original utterances
Validates edge counts against cardinality constraints

3. Property Prediction

Predicts specific configurations for each stage:

Uses stage-specific sub-utterances to avoid ambiguity
Includes task instructions, sub-utterances, stage names, property lists, and examples
Multi-dimensional validation strategy ensures property correctness

Technical Innovations

Hybrid Retrieval-Generation Architecture: Combines fast classifiers with LLM generation, balancing efficiency and accuracy
Hierarchical Validation Mechanism: Performs constraint checking and consistency validation at multiple levels
Modular Design: Individual components can be independently optimized and debugged
Context Localization: Reduces complexity processed by LLM through sub-utterance segmentation

Experimental Setup

Dataset

Stage Prediction: 1,010 natural language process descriptions
Property Prediction: 308 processes containing 1,410 properties
Edge Prediction: 54 complex non-linear processes (6-14 stages)
Classifier Training: 2,697 single-label (utterance, operator) pairs covering 138 semantic labels

Evaluation Metrics

Stage Prediction: Accuracy (overall, single-operation, multi-operation)
Edge Prediction: Structural similarity, exact match rate
Property Prediction: Precision, recall, F1 score

Baseline Methods

Single-prompt: Presents all 142 stages in a single prompt
Agentic: ReAct-style agent approach where LLM autonomously decomposes utterances and invokes classification tools
CAG: The proposed classifier-augmented generation method

Implementation Details

Models: LLaMA-3.2-3B, Granite-3.1-8B, LLaMA-3.3-70B, LLaMA-4-17B
Classifiers: RoBERTa-large and IBM slate-125m-english-rtrvr
Token Usage: CAG approximately 4,000-4,700 tokens vs. Single-prompt approximately 14,000 tokens

Experimental Results

Main Results

Stage Prediction Accuracy Comparison

Method	LLaMA-3.2-3B	Granite-3.1-8B	LLaMA-3.3-70B	LLaMA-4-17B
Single-prompt	71.1%	88.0%	96.4%	95.8%
Agentic	33.4%	45.6%	69.3%	40.0%
CAG	90.1%	94.0%	97.2%	97.7%

Edge Prediction Results (54 non-linear processes)

Structural Similarity: 73% (LLaMA-3.3-70B)
Exact Match: 37% (LLaMA-3.3-70B)

Property Prediction Results (F1 Score)

LLaMA-3.2-3B: 0.79
Granite-3.3-8B: 0.81
LLaMA-3.3-70B: 0.86
LLaMA-4-17B: 0.78

Ablation Studies

Classifier Contribution: Candidate stage filtering significantly improves accuracy
Keyword Matching: Reduces mispredictions on obvious utterances
Few-shot Examples: Targeted examples improve discrimination between similar stages

Case Analysis

Failure Case: For the utterance "Split the full_name field...then capitalize the first letter...", most models return only the split_subrecord stage while missing the modify stage, as the classifier incorrectly maps "capitalize" to the head stage.

Experimental Findings

Model Scale Effect: Larger models perform better across all tasks
Efficiency Gains: CAG reduces token usage by 66% while improving accuracy
Edge Prediction Challenge: Complex non-linear structure prediction remains the most challenging task

Main Research Directions

AI-Driven Workflow Generation: Commercial tools like Zap builder and Power Automate
Application Integration Workflows: GOFA creates application integration workflows from natural language
Query Execution Workflows: Temporary execution tools like FlowMind and AutoFlow
SQL Generation: Natural language to SQL conversion tools like Analyza

Advantages of This Work

First natural language-driven ETL authoring system providing detailed evaluation of stage prediction, edge layout, and attribute generation
Generates reusable general-purpose workflows rather than ad-hoc executions
Complete end-to-end solution including detailed property configuration

Conclusions and Discussion

Main Conclusions

The CAG method significantly outperforms existing approaches on ETL workflow generation tasks
The modular architecture supports transparent reasoning and robust validation
The system has been successfully deployed to production environments, validating its practicality and scalability

Limitations

Classifier Limitations: Trained only on single-label data, potentially missing relevant candidate stages
Edge Prediction Challenges: Exact edge matching achieves only 37%, requiring user revision
Validation Logic: Assumes table and column names are correct or negligible, lacking fuzzy matching
Prompt Portability: Optimized for specific model families, potentially affecting cross-architecture generalization

Future Directions

Explore hybrid architectures combining graph neural networks to improve edge prediction
Develop multi-label classifiers to enhance candidate stage identification
Enhance validation logic to support fuzzy matching and error correction
Extend to other ETL platforms and domains

In-Depth Evaluation

Strengths

Methodological Innovation: The CAG method cleverly combines the advantages of classification and generation, maintaining high accuracy while improving efficiency
Comprehensive Experiments: Covers the complete workflow generation process with detailed evaluation of stage, edge, and property prediction
Practical Value: System deployment in production environment demonstrates real-world applicability
Clear Writing: Well-structured paper with accurate technical descriptions

Weaknesses

Dataset Scale: Relatively small evaluation datasets, particularly only 54 non-linear process samples
Domain Specificity: Primarily targets IBM DataStage platform; generalization capability requires verification
Edge Prediction Performance: 37% exact match rate indicates this module requires significant improvement
Limited Error Analysis: Analysis of failure cases is relatively limited

Impact

Academic Contribution: First systematic solution to complete natural language to ETL workflow conversion
Industrial Value: Provides a viable technical pathway for ETL tool intelligence
Reproducibility: Provides detailed implementation details and prompt templates

Applicable Scenarios

Enterprise Data Integration: Simplifies ETL workflow creation and configuration
Data Science Tools: Provides more user-friendly data processing interfaces for non-professionals
Low-code/No-code Platforms: Can be integrated as an intelligent component in visual development environments

References

This paper cites important works in related fields, including:

ETL technology surveys (Rahm and Do, 2000; Vassiliadis, 2009)
Large language model few-shot learning (Brown et al., 2020)
ReAct agent methods (Yao et al., 2023)
Tool learning research (Schick et al., 2023; Qin et al., 2024)

Overall Assessment: This is a high-quality applied research paper that proposes an innovative CAG method to solve practical problems and validates its effectiveness in production environments. While there is room for improvement in certain technical details, it makes important contributions to the field of natural language-driven workflow generation.