Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
- Paper ID: 2510.10885
- Title: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
- Authors: Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren (Bosch Research North America, USA)
- Classification: cs.CL (Computational Linguistics), cs.DB (Database)
- Conference: Workshop on Test-time Scaling and Reasoning Models at COLM 2025
- Paper Link: https://arxiv.org/abs/2510.10885
Large Language Models (LLMs) are increasingly powering Text-to-SQL systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in practical applications, particularly with the latest reasoning models, remains uncertain. This study benchmarks six lightweight, industry-oriented test-time scaling strategies across four LLMs (including two reasoning models) and evaluates their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, inference latency and token consumption are reported, providing relevant insights for practical system deployment. The study finds that divide-and-conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-oriented LLMs. However, introducing additional workflow steps yields mixed results, with the choice of base model playing a critical role.
The core problem addressed by this research is: How effective are test-time scaling strategies for different types of LLMs in Text2SQL tasks, particularly regarding performance trade-offs in practical industrial application scenarios?
- Practical Value: Text2SQL systems enable non-technical users to access enterprise databases through natural language, offering significant commercial value
- Technical Challenge: With the emergence of reasoning models such as OpenAI o-series and Gemini 2.5, there is a need to reassess the necessity of traditional workflow engineering approaches
- Industrial Demand: Practical deployment requires balancing accuracy, latency, and complexity
- Existing research often focuses on complex agentic workflows that may be overly complicated for industrial applications
- Lack of systematic evaluation of reasoning models in Text2SQL tasks
- Few studies simultaneously consider accuracy and system performance metrics (e.g., latency, token consumption)
The authors pose three key questions:
- Given advances in reasoning models, does extensive prompting and workflow engineering remain valuable?
- Which test-time scaling strategies best balance accuracy and latency?
- How can workflows be optimized for industrial applications?
- Systematic Benchmarking: Comprehensive evaluation of six lightweight, industry-oriented agentic workflows across four LLMs (including general-purpose and reasoning models)
- Multi-dimensional Assessment: Beyond accuracy metrics, provides detailed analysis of inference latency and token consumption
- Practical Insights: Discovers that divide-and-conquer instructions and few-shot demonstrations yield significant improvements across all models
- Industrial Deployment Guidance: Provides actionable guidance on accuracy, efficiency, and complexity trade-offs for practical Text2SQL system deployment
The Text2SQL task aims to translate natural language questions into executable SQL queries. Input consists of natural language questions and database schemas, with output being the corresponding SQL query.
- Flow: SW > EX <> SR
- Description: Employs the ReAct agent's "think-action-observe" cycle, iteratively optimizing queries when encountering execution errors or empty results
- Flow: SW > EX <> SR
- Innovation: Decomposes complex questions into a series of smaller sub-problems, solves them sequentially, and combines responses
- Variants: Separately evaluates effectiveness with and without few-shot demonstrations
- Flow: (SW > EX <> SR) ∥ 5 > MV / CS
- Mechanism: Generates multiple candidate answers, selects final answer through majority voting; if no majority, uses candidate selector agent
- Flow: SW > EX <> SR <> FP
- Objective: Handles syntactically correct but semantically incorrect SQL queries through feedback provider to determine if optimization is needed
- Flow: KE > (ER ∥ CR) > SW > EX <> SR
- Adapted from: CHESS method
- Steps:
- Keyword extractor identifies keywords in questions
- Entity retriever (LSH index-based) and column retriever (semantic similarity-based) run in parallel
- Retrieved information passed to SQL writer
- Lightweight Design: Focuses on industry-ready workflows rather than complex approaches from literature
- Multi-model Comparison: Simultaneously evaluates general-purpose models (GPT-4o, Gemini series) and reasoning models (o4-mini)
- Comprehensive Evaluation: Multi-dimensional assessment framework combining accuracy, latency, and resource consumption
- Name: BIRD Mini-Dev benchmark
- Scale: 500 question-SQL pairs
- Source: Derived subset from original BIRD Dev collection
- Characteristics: Includes complex cross-table queries and real-world database scenarios
- Soft F1-Score: Evaluates SQL query correctness by measuring similarity between tables generated by predicted and ground-truth queries
- Execution Accuracy (EX): Percentage of generated SQL queries producing identical results to ground truth
- Reward-based Valid Efficiency Score (R-VES): Quantifies model efficiency in generating correct and optimized SQL queries
- Execution Error Rate: Percentage of tasks encountering syntax execution errors in workflows
- Inference Time: Duration from receiving user question to generating SQL query (seconds)
- Number of LLM Calls: Average number of LLM calls used in workflows
- Token Count: Average prompt and completion tokens required to generate individual SQL queries (in thousands)
Four LLMs:
- Gemini 1.5 Flash (general-purpose model)
- Gemini 2.5 Flash (general-purpose model)
- GPT-4o (general-purpose model)
- o4-mini (reasoning model)
- All workflows include syntax correction iterations
- Latency measurements affected by multiple factors (model region, network latency, server resources, etc.)
- BIRD Mini-Dev used for efficiency-conscious evaluation
- Key Finding: DC 3-shot+ReAct workflow consistently improves Soft-F1 scores across all models
- GPT-4o: Improved from baseline 61.1 to 64.4
- o4-mini: Improved from baseline 56.3 to 65.5
- Conclusion: Even specialized reasoning models benefit from explicit programmatic guidance
- Best Combination: Divide-and-Conquer + few-shot demonstrations + ReAct shows consistent improvements across all models
- Verification Method: Provides reliable performance improvements on most models
- Gemini 1.5 Flash: 62.58 → 63.63
- Gemini 2.5 Flash: 68.12 → 68.44
- GPT-4o: 64.44 → 64.95
- Retrieval-augmented Method: Generally underperforms, falling below DC 3-shot+ReAct on nearly all models
- Significant Latency Differences:
- Gemini Flash models: 5.02-12.03 seconds
- GPT-4o and o4-mini: 15.70-18.43 seconds
- Cost of Incorrect Answers: Incorrect answers take 19.58% longer to generate than correct ones
- Complexity Impact: More challenging questions require longer processing times, consume more tokens, and often yield lower accuracy
Through error analysis, the study discovers:
- Wrong Query Logic is the most common failure type across all methods and models
- Retrieval-augmented methods consistently exacerbate this problem
- Retrieval methods also increase the ratio of Schema Linking Errors
The paper conducts detailed error analysis, classifying failure cases using the o4-mini model. It finds that retrieval-augmented methods may deprive models of critical information in complex reasoning tasks, leading to performance degradation.
The paper systematically reviews existing Text2SQL agentic workflows, including:
- DIN-SQL's decomposed in-context learning
- MAC-SQL's multi-agent collaborative framework
- CHESS's contextual SQL synthesis
- R3's consensus multi-agent system
Covers multiple strategies including structured reasoning steps, parallel execution, verification, and result aggregation. These methods decompose query generation into modular steps through sequential workflows.
- Importance of Base Model: Strong base models matter more than workflow complexity (Gemini 2.5 Flash baseline performance exceeds the most complex workflows of GPT-4o and Gemini 1.5 Flash)
- Universality of DC+Few-shot: Divide-and-conquer instructions and few-shot demonstrations yield significant improvements across all model types
- Diminishing Returns of Complexity: Increasing workflow complexity does not always yield better results
- Limited Evaluation Scope: Focuses only on lightweight workflows, potentially missing performance ceilings of more complex designs
- Single Dataset: Evaluated only on BIRD Mini-Dev, lacking broader validation
- Relative Nature of Performance Metrics: Reported latency and token consumption are affected by external factors and should be viewed as indicative rather than absolute
- Examine more complex workflow designs
- Validate findings on broader datasets
- Explore applicability of these strategies to other tasks
- Optimize product design to manage user expectations
- Practical Orientation: Focuses on industry-ready solutions considering real deployment constraints
- Multi-dimensional Assessment: Considers not only accuracy but also latency and resource consumption, providing comprehensive perspective for practical applications
- Systematic Comparison: Simultaneously evaluates general-purpose and reasoning models, providing valuable comparative insights
- Detailed Error Analysis: Deep understanding of failure patterns across different methods through error classification
- Limited Sample Size: Uses only 500 samples from BIRD Mini-Dev, potentially affecting generalizability of conclusions
- Incomplete Model Coverage: Lacks comparison with other mainstream models (e.g., Claude, LLaMA series)
- Conservative Workflow Design: Focus on lightweight methods may miss potential of more advanced techniques
- Absence of User Studies: No evaluation of real user experience
- Academic Contribution: Provides systematic benchmarking of test-time scaling strategies for Text2SQL domain
- Industrial Value: Offers practical guidance for enterprise deployment of Text2SQL systems
- Methodological Inspiration: Multi-dimensional evaluation framework applicable to industrialization of other NLP tasks
- Enterprise Database Querying: Suitable for enterprise environments requiring rapid deployment balancing accuracy and efficiency
- Prototype Development: Provides validated workflow patterns for rapid prototyping of Text2SQL systems
- Model Selection Guidance: Helps developers select appropriate base models and workflow strategies based on specific requirements
The paper cites important works in the Text2SQL domain, including:
- BIRD benchmark dataset (Li et al., 2023)
- DIN-SQL decomposition method (Pourreza & Rafiei, 2023)
- CHESS contextual synthesis (Talaei et al., 2024)
- ReAct reasoning framework (Yao et al., 2023)
- Chain-of-Thought prompting (Wei et al., 2022)
This research provides valuable empirical guidance for practical deployment of Text2SQL systems, particularly in balancing accuracy, efficiency, and complexity. Its findings are significant for advancing Text2SQL technology from research prototypes toward industrial applications.