2025-11-17T07:49:13.607812

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Guo, Patel, Ono et al.
Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
academic

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Basic Information

  • Paper ID: 2510.10885
  • Title: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
  • Authors: Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren (Bosch Research North America, USA)
  • Classification: cs.CL (Computational Linguistics), cs.DB (Database)
  • Conference: Workshop on Test-time Scaling and Reasoning Models at COLM 2025
  • Paper Link: https://arxiv.org/abs/2510.10885

Abstract

Large Language Models (LLMs) are increasingly powering Text-to-SQL systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in practical applications, particularly with the latest reasoning models, remains uncertain. This study benchmarks six lightweight, industry-oriented test-time scaling strategies across four LLMs (including two reasoning models) and evaluates their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, inference latency and token consumption are reported, providing relevant insights for practical system deployment. The study finds that divide-and-conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-oriented LLMs. However, introducing additional workflow steps yields mixed results, with the choice of base model playing a critical role.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: How effective are test-time scaling strategies for different types of LLMs in Text2SQL tasks, particularly regarding performance trade-offs in practical industrial application scenarios?

Research Significance

  1. Practical Value: Text2SQL systems enable non-technical users to access enterprise databases through natural language, offering significant commercial value
  2. Technical Challenge: With the emergence of reasoning models such as OpenAI o-series and Gemini 2.5, there is a need to reassess the necessity of traditional workflow engineering approaches
  3. Industrial Demand: Practical deployment requires balancing accuracy, latency, and complexity

Limitations of Existing Approaches

  1. Existing research often focuses on complex agentic workflows that may be overly complicated for industrial applications
  2. Lack of systematic evaluation of reasoning models in Text2SQL tasks
  3. Few studies simultaneously consider accuracy and system performance metrics (e.g., latency, token consumption)

Research Motivation

The authors pose three key questions:

  • Given advances in reasoning models, does extensive prompting and workflow engineering remain valuable?
  • Which test-time scaling strategies best balance accuracy and latency?
  • How can workflows be optimized for industrial applications?

Core Contributions

  1. Systematic Benchmarking: Comprehensive evaluation of six lightweight, industry-oriented agentic workflows across four LLMs (including general-purpose and reasoning models)
  2. Multi-dimensional Assessment: Beyond accuracy metrics, provides detailed analysis of inference latency and token consumption
  3. Practical Insights: Discovers that divide-and-conquer instructions and few-shot demonstrations yield significant improvements across all models
  4. Industrial Deployment Guidance: Provides actionable guidance on accuracy, efficiency, and complexity trade-offs for practical Text2SQL system deployment

Methodology Details

Task Definition

The Text2SQL task aims to translate natural language questions into executable SQL queries. Input consists of natural language questions and database schemas, with output being the corresponding SQL query.

Six Agentic Workflows

1. CoT + ReAct (Baseline)

  • Flow: SW > EX <> SR
  • Description: Employs the ReAct agent's "think-action-observe" cycle, iteratively optimizing queries when encountering execution errors or empty results

2. Divide-and-Conquer (With/Without Few-shot)

  • Flow: SW > EX <> SR
  • Innovation: Decomposes complex questions into a series of smaller sub-problems, solves them sequentially, and combines responses
  • Variants: Separately evaluates effectiveness with and without few-shot demonstrations

3. Parallel Scaling

  • Flow: (SW > EX <> SR) ∥ 5 > MV / CS
  • Mechanism: Generates multiple candidate answers, selects final answer through majority voting; if no majority, uses candidate selector agent

4. Result Verification

  • Flow: SW > EX <> SR <> FP
  • Objective: Handles syntactically correct but semantically incorrect SQL queries through feedback provider to determine if optimization is needed

5. Retrieval-based Structured Reasoning

  • Flow: KE > (ER ∥ CR) > SW > EX <> SR
  • Adapted from: CHESS method
  • Steps:
    • Keyword extractor identifies keywords in questions
    • Entity retriever (LSH index-based) and column retriever (semantic similarity-based) run in parallel
    • Retrieved information passed to SQL writer

Technical Innovations

  1. Lightweight Design: Focuses on industry-ready workflows rather than complex approaches from literature
  2. Multi-model Comparison: Simultaneously evaluates general-purpose models (GPT-4o, Gemini series) and reasoning models (o4-mini)
  3. Comprehensive Evaluation: Multi-dimensional assessment framework combining accuracy, latency, and resource consumption

Experimental Setup

Dataset

  • Name: BIRD Mini-Dev benchmark
  • Scale: 500 question-SQL pairs
  • Source: Derived subset from original BIRD Dev collection
  • Characteristics: Includes complex cross-table queries and real-world database scenarios

Evaluation Metrics

Accuracy Metrics

  1. Soft F1-Score: Evaluates SQL query correctness by measuring similarity between tables generated by predicted and ground-truth queries
  2. Execution Accuracy (EX): Percentage of generated SQL queries producing identical results to ground truth
  3. Reward-based Valid Efficiency Score (R-VES): Quantifies model efficiency in generating correct and optimized SQL queries

System Performance Metrics

  1. Execution Error Rate: Percentage of tasks encountering syntax execution errors in workflows
  2. Inference Time: Duration from receiving user question to generating SQL query (seconds)
  3. Number of LLM Calls: Average number of LLM calls used in workflows
  4. Token Count: Average prompt and completion tokens required to generate individual SQL queries (in thousands)

Comparison Methods

Four LLMs:

  • Gemini 1.5 Flash (general-purpose model)
  • Gemini 2.5 Flash (general-purpose model)
  • GPT-4o (general-purpose model)
  • o4-mini (reasoning model)

Implementation Details

  • All workflows include syntax correction iterations
  • Latency measurements affected by multiple factors (model region, network latency, server resources, etc.)
  • BIRD Mini-Dev used for efficiency-conscious evaluation

Experimental Results

Main Findings

RQ1: Reasoning Models vs. General-Purpose Models Performance

  • Key Finding: DC 3-shot+ReAct workflow consistently improves Soft-F1 scores across all models
  • GPT-4o: Improved from baseline 61.1 to 64.4
  • o4-mini: Improved from baseline 56.3 to 65.5
  • Conclusion: Even specialized reasoning models benefit from explicit programmatic guidance

RQ2: Most Effective Scaling Methods

  1. Best Combination: Divide-and-Conquer + few-shot demonstrations + ReAct shows consistent improvements across all models
  2. Verification Method: Provides reliable performance improvements on most models
    • Gemini 1.5 Flash: 62.58 → 63.63
    • Gemini 2.5 Flash: 68.12 → 68.44
    • GPT-4o: 64.44 → 64.95
  3. Retrieval-augmented Method: Generally underperforms, falling below DC 3-shot+ReAct on nearly all models

RQ3: Trade-offs Between Accuracy and System Performance

  1. Significant Latency Differences:
    • Gemini Flash models: 5.02-12.03 seconds
    • GPT-4o and o4-mini: 15.70-18.43 seconds
  2. Cost of Incorrect Answers: Incorrect answers take 19.58% longer to generate than correct ones
  3. Complexity Impact: More challenging questions require longer processing times, consume more tokens, and often yield lower accuracy

Ablation Studies

Through error analysis, the study discovers:

  • Wrong Query Logic is the most common failure type across all methods and models
  • Retrieval-augmented methods consistently exacerbate this problem
  • Retrieval methods also increase the ratio of Schema Linking Errors

Case Analysis

The paper conducts detailed error analysis, classifying failure cases using the o4-mini model. It finds that retrieval-augmented methods may deprive models of critical information in complex reasoning tasks, leading to performance degradation.

Text2SQL Agentic Workflows

The paper systematically reviews existing Text2SQL agentic workflows, including:

  • DIN-SQL's decomposed in-context learning
  • MAC-SQL's multi-agent collaborative framework
  • CHESS's contextual SQL synthesis
  • R3's consensus multi-agent system

Test-Time Scaling Strategies

Covers multiple strategies including structured reasoning steps, parallel execution, verification, and result aggregation. These methods decompose query generation into modular steps through sequential workflows.

Conclusions and Discussion

Main Conclusions

  1. Importance of Base Model: Strong base models matter more than workflow complexity (Gemini 2.5 Flash baseline performance exceeds the most complex workflows of GPT-4o and Gemini 1.5 Flash)
  2. Universality of DC+Few-shot: Divide-and-conquer instructions and few-shot demonstrations yield significant improvements across all model types
  3. Diminishing Returns of Complexity: Increasing workflow complexity does not always yield better results

Limitations

  1. Limited Evaluation Scope: Focuses only on lightweight workflows, potentially missing performance ceilings of more complex designs
  2. Single Dataset: Evaluated only on BIRD Mini-Dev, lacking broader validation
  3. Relative Nature of Performance Metrics: Reported latency and token consumption are affected by external factors and should be viewed as indicative rather than absolute

Future Directions

  1. Examine more complex workflow designs
  2. Validate findings on broader datasets
  3. Explore applicability of these strategies to other tasks
  4. Optimize product design to manage user expectations

In-Depth Evaluation

Strengths

  1. Practical Orientation: Focuses on industry-ready solutions considering real deployment constraints
  2. Multi-dimensional Assessment: Considers not only accuracy but also latency and resource consumption, providing comprehensive perspective for practical applications
  3. Systematic Comparison: Simultaneously evaluates general-purpose and reasoning models, providing valuable comparative insights
  4. Detailed Error Analysis: Deep understanding of failure patterns across different methods through error classification

Weaknesses

  1. Limited Sample Size: Uses only 500 samples from BIRD Mini-Dev, potentially affecting generalizability of conclusions
  2. Incomplete Model Coverage: Lacks comparison with other mainstream models (e.g., Claude, LLaMA series)
  3. Conservative Workflow Design: Focus on lightweight methods may miss potential of more advanced techniques
  4. Absence of User Studies: No evaluation of real user experience

Impact

  1. Academic Contribution: Provides systematic benchmarking of test-time scaling strategies for Text2SQL domain
  2. Industrial Value: Offers practical guidance for enterprise deployment of Text2SQL systems
  3. Methodological Inspiration: Multi-dimensional evaluation framework applicable to industrialization of other NLP tasks

Applicable Scenarios

  1. Enterprise Database Querying: Suitable for enterprise environments requiring rapid deployment balancing accuracy and efficiency
  2. Prototype Development: Provides validated workflow patterns for rapid prototyping of Text2SQL systems
  3. Model Selection Guidance: Helps developers select appropriate base models and workflow strategies based on specific requirements

References

The paper cites important works in the Text2SQL domain, including:

  • BIRD benchmark dataset (Li et al., 2023)
  • DIN-SQL decomposition method (Pourreza & Rafiei, 2023)
  • CHESS contextual synthesis (Talaei et al., 2024)
  • ReAct reasoning framework (Yao et al., 2023)
  • Chain-of-Thought prompting (Wei et al., 2022)

This research provides valuable empirical guidance for practical deployment of Text2SQL systems, particularly in balancing accuracy, efficiency, and complexity. Its findings are significant for advancing Text2SQL technology from research prototypes toward industrial applications.