2025-11-17T07:49:13.607812

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Guo, Patel, Ono et al.

Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.

academic

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Basic Information

Paper ID: 2510.10885
Title: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
Authors: Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren (Bosch Research North America, USA)
Classification: cs.CL (Computational Linguistics), cs.DB (Database)
Conference: Workshop on Test-time Scaling and Reasoning Models at COLM 2025
Paper Link: https://arxiv.org/abs/2510.10885

Abstract

Large Language Models (LLMs) are increasingly powering Text-to-SQL systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in practical applications, particularly with the latest reasoning models, remains uncertain. This study benchmarks six lightweight, industry-oriented test-time scaling strategies across four LLMs (including two reasoning models) and evaluates their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, inference latency and token consumption are reported, providing relevant insights for practical system deployment. The study finds that divide-and-conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-oriented LLMs. However, introducing additional workflow steps yields mixed results, with the choice of base model playing a critical role.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: How effective are test-time scaling strategies for different types of LLMs in Text2SQL tasks, particularly regarding performance trade-offs in practical industrial application scenarios?

Research Significance

Practical Value: Text2SQL systems enable non-technical users to access enterprise databases through natural language, offering significant commercial value
Technical Challenge: With the emergence of reasoning models such as OpenAI o-series and Gemini 2.5, there is a need to reassess the necessity of traditional workflow engineering approaches
Industrial Demand: Practical deployment requires balancing accuracy, latency, and complexity

Limitations of Existing Approaches

Existing research often focuses on complex agentic workflows that may be overly complicated for industrial applications
Lack of systematic evaluation of reasoning models in Text2SQL tasks
Few studies simultaneously consider accuracy and system performance metrics (e.g., latency, token consumption)

Research Motivation

The authors pose three key questions:

Given advances in reasoning models, does extensive prompting and workflow engineering remain valuable?
Which test-time scaling strategies best balance accuracy and latency?
How can workflows be optimized for industrial applications?

Core Contributions

Systematic Benchmarking: Comprehensive evaluation of six lightweight, industry-oriented agentic workflows across four LLMs (including general-purpose and reasoning models)
Multi-dimensional Assessment: Beyond accuracy metrics, provides detailed analysis of inference latency and token consumption
Practical Insights: Discovers that divide-and-conquer instructions and few-shot demonstrations yield significant improvements across all models
Industrial Deployment Guidance: Provides actionable guidance on accuracy, efficiency, and complexity trade-offs for practical Text2SQL system deployment

Methodology Details

Task Definition

The Text2SQL task aims to translate natural language questions into executable SQL queries. Input consists of natural language questions and database schemas, with output being the corresponding SQL query.

Six Agentic Workflows

1. CoT + ReAct (Baseline)

Flow: SW > EX <> SR
Description: Employs the ReAct agent's "think-action-observe" cycle, iteratively optimizing queries when encountering execution errors or empty results

2. Divide-and-Conquer (With/Without Few-shot)

Flow: SW > EX <> SR
Innovation: Decomposes complex questions into a series of smaller sub-problems, solves them sequentially, and combines responses
Variants: Separately evaluates effectiveness with and without few-shot demonstrations

3. Parallel Scaling

Flow: (SW > EX <> SR) ∥ 5 > MV / CS
Mechanism: Generates multiple candidate answers, selects final answer through majority voting; if no majority, uses candidate selector agent

4. Result Verification

Flow: SW > EX <> SR <> FP
Objective: Handles syntactically correct but semantically incorrect SQL queries through feedback provider to determine if optimization is needed

5. Retrieval-based Structured Reasoning

Flow: KE > (ER ∥ CR) > SW > EX <> SR
Adapted from: CHESS method
Steps:
- Keyword extractor identifies keywords in questions
- Entity retriever (LSH index-based) and column retriever (semantic similarity-based) run in parallel
- Retrieved information passed to SQL writer

Technical Innovations

Lightweight Design: Focuses on industry-ready workflows rather than complex approaches from literature
Multi-model Comparison: Simultaneously evaluates general-purpose models (GPT-4o, Gemini series) and reasoning models (o4-mini)
Comprehensive Evaluation: Multi-dimensional assessment framework combining accuracy, latency, and resource consumption

Experimental Setup

Dataset

Name: BIRD Mini-Dev benchmark
Scale: 500 question-SQL pairs
Source: Derived subset from original BIRD Dev collection
Characteristics: Includes complex cross-table queries and real-world database scenarios

Evaluation Metrics

Accuracy Metrics

Soft F1-Score: Evaluates SQL query correctness by measuring similarity between tables generated by predicted and ground-truth queries
Execution Accuracy (EX): Percentage of generated SQL queries producing identical results to ground truth
Reward-based Valid Efficiency Score (R-VES): Quantifies model efficiency in generating correct and optimized SQL queries

System Performance Metrics

Execution Error Rate: Percentage of tasks encountering syntax execution errors in workflows
Inference Time: Duration from receiving user question to generating SQL query (seconds)
Number of LLM Calls: Average number of LLM calls used in workflows
Token Count: Average prompt and completion tokens required to generate individual SQL queries (in thousands)

Comparison Methods

Four LLMs:

Gemini 1.5 Flash (general-purpose model)
Gemini 2.5 Flash (general-purpose model)
GPT-4o (general-purpose model)
o4-mini (reasoning model)

Implementation Details

All workflows include syntax correction iterations
Latency measurements affected by multiple factors (model region, network latency, server resources, etc.)
BIRD Mini-Dev used for efficiency-conscious evaluation

Experimental Results

Main Findings

RQ1: Reasoning Models vs. General-Purpose Models Performance

Key Finding: DC 3-shot+ReAct workflow consistently improves Soft-F1 scores across all models
GPT-4o: Improved from baseline 61.1 to 64.4
o4-mini: Improved from baseline 56.3 to 65.5
Conclusion: Even specialized reasoning models benefit from explicit programmatic guidance

RQ2: Most Effective Scaling Methods

Best Combination: Divide-and-Conquer + few-shot demonstrations + ReAct shows consistent improvements across all models
Verification Method: Provides reliable performance improvements on most models
- Gemini 1.5 Flash: 62.58 → 63.63
- Gemini 2.5 Flash: 68.12 → 68.44
- GPT-4o: 64.44 → 64.95
Retrieval-augmented Method: Generally underperforms, falling below DC 3-shot+ReAct on nearly all models

RQ3: Trade-offs Between Accuracy and System Performance

Significant Latency Differences:
- Gemini Flash models: 5.02-12.03 seconds
- GPT-4o and o4-mini: 15.70-18.43 seconds
Cost of Incorrect Answers: Incorrect answers take 19.58% longer to generate than correct ones
Complexity Impact: More challenging questions require longer processing times, consume more tokens, and often yield lower accuracy

Ablation Studies

Through error analysis, the study discovers:

Wrong Query Logic is the most common failure type across all methods and models
Retrieval-augmented methods consistently exacerbate this problem
Retrieval methods also increase the ratio of Schema Linking Errors

Case Analysis

The paper conducts detailed error analysis, classifying failure cases using the o4-mini model. It finds that retrieval-augmented methods may deprive models of critical information in complex reasoning tasks, leading to performance degradation.

Text2SQL Agentic Workflows

The paper systematically reviews existing Text2SQL agentic workflows, including:

DIN-SQL's decomposed in-context learning
MAC-SQL's multi-agent collaborative framework
CHESS's contextual SQL synthesis
R3's consensus multi-agent system

Test-Time Scaling Strategies

Covers multiple strategies including structured reasoning steps, parallel execution, verification, and result aggregation. These methods decompose query generation into modular steps through sequential workflows.

Conclusions and Discussion

Main Conclusions

Importance of Base Model: Strong base models matter more than workflow complexity (Gemini 2.5 Flash baseline performance exceeds the most complex workflows of GPT-4o and Gemini 1.5 Flash)
Universality of DC+Few-shot: Divide-and-conquer instructions and few-shot demonstrations yield significant improvements across all model types
Diminishing Returns of Complexity: Increasing workflow complexity does not always yield better results

Limitations

Limited Evaluation Scope: Focuses only on lightweight workflows, potentially missing performance ceilings of more complex designs
Single Dataset: Evaluated only on BIRD Mini-Dev, lacking broader validation
Relative Nature of Performance Metrics: Reported latency and token consumption are affected by external factors and should be viewed as indicative rather than absolute

Future Directions

Examine more complex workflow designs
Validate findings on broader datasets
Explore applicability of these strategies to other tasks
Optimize product design to manage user expectations

In-Depth Evaluation

Strengths

Practical Orientation: Focuses on industry-ready solutions considering real deployment constraints
Multi-dimensional Assessment: Considers not only accuracy but also latency and resource consumption, providing comprehensive perspective for practical applications
Systematic Comparison: Simultaneously evaluates general-purpose and reasoning models, providing valuable comparative insights
Detailed Error Analysis: Deep understanding of failure patterns across different methods through error classification

Weaknesses

Limited Sample Size: Uses only 500 samples from BIRD Mini-Dev, potentially affecting generalizability of conclusions
Incomplete Model Coverage: Lacks comparison with other mainstream models (e.g., Claude, LLaMA series)
Conservative Workflow Design: Focus on lightweight methods may miss potential of more advanced techniques
Absence of User Studies: No evaluation of real user experience

Impact

Academic Contribution: Provides systematic benchmarking of test-time scaling strategies for Text2SQL domain
Industrial Value: Offers practical guidance for enterprise deployment of Text2SQL systems
Methodological Inspiration: Multi-dimensional evaluation framework applicable to industrialization of other NLP tasks

Applicable Scenarios

Enterprise Database Querying: Suitable for enterprise environments requiring rapid deployment balancing accuracy and efficiency
Prototype Development: Provides validated workflow patterns for rapid prototyping of Text2SQL systems
Model Selection Guidance: Helps developers select appropriate base models and workflow strategies based on specific requirements

References

The paper cites important works in the Text2SQL domain, including:

BIRD benchmark dataset (Li et al., 2023)
DIN-SQL decomposition method (Pourreza & Rafiei, 2023)
CHESS contextual synthesis (Talaei et al., 2024)
ReAct reasoning framework (Yao et al., 2023)
Chain-of-Thought prompting (Wei et al., 2022)

This research provides valuable empirical guidance for practical deployment of Text2SQL systems, particularly in balancing accuracy, efficiency, and complexity. Its findings are significant for advancing Text2SQL technology from research prototypes toward industrial applications.