ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.
- Paper ID: 2510.12825
- Title: Classifier-Augmented Generation for Structured Workflow Prediction
- Authors: Thomas Gschwind, Shramona Chakraborty, Nitin Gupta, and Sameep Mehta (IBM Research)
- Classification: cs.CL cs.AI cs.DB cs.LG
- Publication Date: October 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.12825
ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but the configuration phase and its attributes remain time-consuming and require deep tool knowledge. This paper proposes a system that converts natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of processes. At its core is the Classifier-Augmented Generation (CAG) method, which combines utterance decomposition with classifiers and stage-specific few-shot prompting to produce accurate stage predictions. These stages are connected into non-linear workflows through edge prediction, and stage attributes are inferred from sub-utterance contexts. Compared to strong baseline methods, CAG demonstrates higher accuracy and efficiency while significantly reducing token usage.
- Core Problem: The configuration complexity of ETL tools hinders user adoption. Even expert users must manually configure transformation stages and specify dozens of low-level attributes per stage, making the authoring process tedious and error-prone.
- Significance: ETL and ELT workflows form the foundation of modern enterprise data integration and analytics pipelines, yet traditional graphical interfaces still require substantial manual configuration work.
- Limitations of Existing Approaches:
- Early methods addressed challenges through custom scripts or simplified GUI-based approaches
- Some explored semantic and ontology-driven ETL generation
- Lack of end-to-end systems for natural language to executable workflow conversion
- Research Motivation: Advances in large language models provide new opportunities for directly synthesizing workflows from natural language, potentially reducing configuration overhead and improving accessibility.
- Proposed the Classifier-Augmented Generation (CAG) method: Combines utterance decomposition, classifier-based stage retrieval, and few-shot prompting to predict workflow stage sequences
- Developed an end-to-end workflow generation system: Comprising three core modules for stage prediction, edge prediction, and attribute prediction
- Achieved significant performance improvements: Attaining over 97% accuracy on stage prediction while reducing token usage by over 60%
- Provided a modular and interpretable architecture: Supporting robust validation and constraint checking
- Completed production environment deployment: The system has been integrated into the IBM DataStage production tool
Input: Natural language description of ETL workflow requirements
Output: Complete executable DataStage workflow, including:
- Workflow stage sequence
- Connections between stages (edges)
- Detailed property configuration for each stage
The CAG method comprises the following steps:
- Utterance Decomposition: Decomposes user input into sub-utterances describing individual stages
- Classifier Retrieval: Uses a trained classification model to identify candidate stages
- Keyword Matching: Scans user utterances for stage names and their synonyms
- Targeted Generation: Generates targeted descriptions and few-shot examples based on candidate stages for final multi-label prediction by the LLM
Handles non-linear workflow structures:
- Assigns unique names to repeated stages
- Segments utterances into sub-utterances based on predicted stages
- Predicts flow structure based on node lists and original utterances
- Validates edge counts against cardinality constraints
Predicts specific configurations for each stage:
- Uses stage-specific sub-utterances to avoid ambiguity
- Includes task instructions, sub-utterances, stage names, property lists, and examples
- Multi-dimensional validation strategy ensures property correctness
- Hybrid Retrieval-Generation Architecture: Combines fast classifiers with LLM generation, balancing efficiency and accuracy
- Hierarchical Validation Mechanism: Performs constraint checking and consistency validation at multiple levels
- Modular Design: Individual components can be independently optimized and debugged
- Context Localization: Reduces complexity processed by LLM through sub-utterance segmentation
- Stage Prediction: 1,010 natural language process descriptions
- Property Prediction: 308 processes containing 1,410 properties
- Edge Prediction: 54 complex non-linear processes (6-14 stages)
- Classifier Training: 2,697 single-label (utterance, operator) pairs covering 138 semantic labels
- Stage Prediction: Accuracy (overall, single-operation, multi-operation)
- Edge Prediction: Structural similarity, exact match rate
- Property Prediction: Precision, recall, F1 score
- Single-prompt: Presents all 142 stages in a single prompt
- Agentic: ReAct-style agent approach where LLM autonomously decomposes utterances and invokes classification tools
- CAG: The proposed classifier-augmented generation method
- Models: LLaMA-3.2-3B, Granite-3.1-8B, LLaMA-3.3-70B, LLaMA-4-17B
- Classifiers: RoBERTa-large and IBM slate-125m-english-rtrvr
- Token Usage: CAG approximately 4,000-4,700 tokens vs. Single-prompt approximately 14,000 tokens
| Method | LLaMA-3.2-3B | Granite-3.1-8B | LLaMA-3.3-70B | LLaMA-4-17B |
|---|
| Single-prompt | 71.1% | 88.0% | 96.4% | 95.8% |
| Agentic | 33.4% | 45.6% | 69.3% | 40.0% |
| CAG | 90.1% | 94.0% | 97.2% | 97.7% |
- Structural Similarity: 73% (LLaMA-3.3-70B)
- Exact Match: 37% (LLaMA-3.3-70B)
- LLaMA-3.2-3B: 0.79
- Granite-3.3-8B: 0.81
- LLaMA-3.3-70B: 0.86
- LLaMA-4-17B: 0.78
- Classifier Contribution: Candidate stage filtering significantly improves accuracy
- Keyword Matching: Reduces mispredictions on obvious utterances
- Few-shot Examples: Targeted examples improve discrimination between similar stages
Failure Case: For the utterance "Split the full_name field...then capitalize the first letter...", most models return only the split_subrecord stage while missing the modify stage, as the classifier incorrectly maps "capitalize" to the head stage.
- Model Scale Effect: Larger models perform better across all tasks
- Efficiency Gains: CAG reduces token usage by 66% while improving accuracy
- Edge Prediction Challenge: Complex non-linear structure prediction remains the most challenging task
- AI-Driven Workflow Generation: Commercial tools like Zap builder and Power Automate
- Application Integration Workflows: GOFA creates application integration workflows from natural language
- Query Execution Workflows: Temporary execution tools like FlowMind and AutoFlow
- SQL Generation: Natural language to SQL conversion tools like Analyza
- First natural language-driven ETL authoring system providing detailed evaluation of stage prediction, edge layout, and attribute generation
- Generates reusable general-purpose workflows rather than ad-hoc executions
- Complete end-to-end solution including detailed property configuration
- The CAG method significantly outperforms existing approaches on ETL workflow generation tasks
- The modular architecture supports transparent reasoning and robust validation
- The system has been successfully deployed to production environments, validating its practicality and scalability
- Classifier Limitations: Trained only on single-label data, potentially missing relevant candidate stages
- Edge Prediction Challenges: Exact edge matching achieves only 37%, requiring user revision
- Validation Logic: Assumes table and column names are correct or negligible, lacking fuzzy matching
- Prompt Portability: Optimized for specific model families, potentially affecting cross-architecture generalization
- Explore hybrid architectures combining graph neural networks to improve edge prediction
- Develop multi-label classifiers to enhance candidate stage identification
- Enhance validation logic to support fuzzy matching and error correction
- Extend to other ETL platforms and domains
- Methodological Innovation: The CAG method cleverly combines the advantages of classification and generation, maintaining high accuracy while improving efficiency
- Comprehensive Experiments: Covers the complete workflow generation process with detailed evaluation of stage, edge, and property prediction
- Practical Value: System deployment in production environment demonstrates real-world applicability
- Clear Writing: Well-structured paper with accurate technical descriptions
- Dataset Scale: Relatively small evaluation datasets, particularly only 54 non-linear process samples
- Domain Specificity: Primarily targets IBM DataStage platform; generalization capability requires verification
- Edge Prediction Performance: 37% exact match rate indicates this module requires significant improvement
- Limited Error Analysis: Analysis of failure cases is relatively limited
- Academic Contribution: First systematic solution to complete natural language to ETL workflow conversion
- Industrial Value: Provides a viable technical pathway for ETL tool intelligence
- Reproducibility: Provides detailed implementation details and prompt templates
- Enterprise Data Integration: Simplifies ETL workflow creation and configuration
- Data Science Tools: Provides more user-friendly data processing interfaces for non-professionals
- Low-code/No-code Platforms: Can be integrated as an intelligent component in visual development environments
This paper cites important works in related fields, including:
- ETL technology surveys (Rahm and Do, 2000; Vassiliadis, 2009)
- Large language model few-shot learning (Brown et al., 2020)
- ReAct agent methods (Yao et al., 2023)
- Tool learning research (Schick et al., 2023; Qin et al., 2024)
Overall Assessment: This is a high-quality applied research paper that proposes an innovative CAG method to solve practical problems and validates its effectiveness in production environments. While there is room for improvement in certain technical details, it makes important contributions to the field of natural language-driven workflow generation.