Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.
academic- Paper ID: 2510.13853
- Title: BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
- Authors: Fabian Wenz (TU Munich & MIT), Omar Bouattour (TU Munich & MIT), Devin Yang (MIT), Justin Choi (MIT), Cecil Gregg (MIT), Nesime Tatbul (Intel Labs & MIT), Çağatay Demiralp (AWS AI Labs & MIT)
- Classification: cs.CL, cs.AI, cs.DB, cs.HC
- Venue: CIDR 2026 (16th Annual Conference on Innovative Data Systems Research)
- Paper Link: https://arxiv.org/abs/2510.13853
Large Language Models (LLMs) have been successfully applied to numerous tasks, including text-to-SQL generation. However, most research focuses on public datasets such as Fiben, Spider, and Bird. Prior work by the authors demonstrated that LLMs exhibit significantly degraded performance when querying large private enterprise data warehouses, leading to the publication of Beaver, the first private enterprise text-to-SQL benchmark. To address the challenges of manual annotation of SQL logs, this paper proposes BenchPress—a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. The system employs Retrieval-Augmented Generation (RAG) and LLMs to generate multiple natural language descriptions for SQL queries, which human experts subsequently select, rank, or edit to ensure accuracy and domain alignment. Experimental results demonstrate that BenchPress significantly reduces the time and effort required to create high-quality benchmarks.
- Gap between public benchmarks and enterprise reality: While LLMs perform excellently on public datasets such as Spider, Bird, and Fiben, execution accuracy on enterprise data warehouses drops dramatically (as shown in Figure 1, declining from 90%+ to near 0%)
- Difficulty in annotating enterprise SQL logs: Manually creating corresponding natural language questions for SQL queries is both time-consuming and expensive, requiring participation from skilled database administrators
- Domain-specific challenges: Enterprise data exhibits complex schemas, domain-specific terminology, privacy constraints, and other distinctive characteristics
- Enterprises require performance evaluation of text-to-SQL models on private data before deployment
- Prevention of deployment failures caused by domain mismatch
- Support for model domain adaptation and fine-tuning strategy optimization
- Public benchmarks lack enterprise-specific complexity (schema ambiguity, domain terminology, etc.)
- Fully manual annotation is costly and inefficient
- General-purpose LLMs lack domain context and structured support
- Proposes the BenchPress system: The first human-in-the-loop annotation system specifically designed for rapid creation of domain-specific text-to-SQL benchmarks
- Innovative workflow design: A modular architecture combining Retrieval-Augmented Generation (RAG), query decomposition, and human feedback
- Comprehensive user study: Comparative experiments demonstrating BenchPress's advantages in annotation accuracy, efficiency, and semantic fidelity
- Open-source tool: Provides a readily usable system supporting multiple public benchmarks and enterprise data
Input: SQL query + database schema + optional historical annotation examples
Output: Corresponding natural language description
Constraints: Maintain semantic accuracy, domain terminology consistency, and privacy protection
- Project setup: Select or create an annotation project for specific enterprise workloads
- Data ingestion: Upload SQL logs and schema files, or select from supported public benchmarks
- Task configuration: Choose annotation direction (currently supports SQL-to-NL) and language model
- Query decomposition (optional): Rewrite nested SQL queries as series of Common Table Expressions (CTEs)
- Context retrieval: Use dense vector embeddings (e.g., Sentence-BERT) to retrieve semantically similar examples and relevant table schemas
- Candidate generation: LLM generates four candidate natural language descriptions based on retrieved context
- Reassembly (optional): Merge sub-query-level descriptions into complete query explanations
- Human feedback: Annotators rank, optimize, or discard LLM outputs
- Review and export: Assess output quality and export as benchmark format
- Use dense vector search to retrieve semantically similar SQL queries and their annotations
- Embed examples in prompts to provide realistic expression patterns and schema usage guidance
- Balance informativeness with prompt efficiency by selecting top-k retrieval examples
- Decompose structurally complex nested queries
- Generate natural language descriptions for sub-queries independently before reassembly
- Reduce cognitive load and improve annotation accuracy
- Structured iterative review process ensures enterprise quality standards
- Support prompt optimization and feedback-driven improvement loops
- Follow responsible AI design principles aligned with Google PAIR guidelines
- Beaver: The first private enterprise text-to-SQL benchmark based on SQL logs from MIT and other institutions, containing 300+ schemas and nearly 4,000 queries
- Bird: A large-scale public database benchmark
- Total of 30 SQL queries for user study, sourced from Beaver and Bird datasets (anonymized)
- Annotation accuracy: Manual verification of NL description fidelity to SQL queries
- Annotation latency: Total annotation time per participant
- Semantic fidelity: Assessed through back-translation tasks using 5-level rating scale
- BenchPress group: Uses complete BenchPress interface
- Manual group: Provided only with schema files and logs, no LLM support
- Generic LLM group: Uses standard ChatGPT interface, no RAG support
- 18 participants stratified by SQL proficiency into advanced and non-advanced levels
- Balanced Latin square design ensures counterbalancing
- Each participant annotates the same 30 SQL queries
| Method | Beaver | Bird | Overall |
|---|
| BenchPress | 86.1% | 100.0% | 93.0% |
| Generic LLM | 66.2% | 100.0% | 83.1% |
| Manual | 60.1% | 87.8% | 73.9% |
| Method | Beaver | Bird | Total |
|---|
| BenchPress | 16.1 min | 12.0 min | 28.1 min |
| Generic LLM | 16.2 min | 15.8 min | 32.0 min |
| Manual | 102.1 min | 82.8 min | 183.9 min |
BenchPress produced the highest proportion of completely correct (Level 5) outputs in 5-level clarity assessment, demonstrating superior semantic clarity.
- Tool effectiveness: BenchPress outperforms comparison methods across all metrics
- Dataset complexity impact: Performance differences between tools are more pronounced on complex enterprise datasets (Beaver)
- Domain adaptability: BenchPress excels at handling enterprise-specific terminology and complex schemas
- Public benchmarks: Spider, Bird, Fiben, and others advance general text-to-SQL tasks
- Enterprise benchmarks: Beaver introduces enterprise-level complexity for the first time, exposing LLM difficulties with heterogeneous schemas
- Codex, GPT-4, DeepSeek, and others perform strongly on public datasets
- However, performance degrades significantly in domain-specific or enterprise environments
- Existing systems primarily target public or synthetic data
- BenchPress specifically supports human-in-the-loop workflows for private enterprise logs
- BenchPress significantly improves efficiency and quality in creating domain-specific text-to-SQL benchmarks
- Human-in-the-loop approaches outperform purely automated or purely manual methods in handling enterprise data complexity
- Public benchmarks inadequately reflect the structural and linguistic complexity of enterprise SQL logs
- Current system primarily focuses on SQL-to-text annotation
- Requires domain expert participation, incurring some human cost
- Decomposition strategies may be insufficient for extremely complex nested queries
- Bidirectional annotation: Integrate text-to-SQL generation to support iterative validation
- Robustness assessment: Systematically rephrase natural language queries in existing benchmarks
- Automation enhancement: Further reduce manual intervention requirements
- High practical value: Addresses real pain points in enterprise text-to-SQL model deployment
- Strong methodological innovation: Cleverly combines RAG, query decomposition, and human-in-the-loop collaboration
- Rigorous experimental design: Well-designed comparative experiments with comprehensive evaluation dimensions
- Open-source contribution: Provides readily usable tools and resources
- Limited user study scale: Sample size of 18 participants is relatively small
- Domain generalizability: Primarily validated in education and technology sectors; applicability to other industries remains to be verified
- Insufficient cost analysis: Lacks detailed cost-benefit analysis
- Academic contribution: Provides new methodology for enterprise AI application evaluation
- Practical value: Directly addresses real industry needs
- Reproducibility: Open-source code and detailed documentation support reproduction and extension
- Enterprises need to evaluate text-to-SQL model performance on private data
- Research institutions construct domain-specific text-to-SQL benchmarks
- Data teams optimize model deployment and fine-tuning strategies
This paper cites 21 relevant references covering key areas including text-to-SQL benchmarks, LLM applications, annotation systems, and enterprise data challenges, providing a solid theoretical foundation for the research.
Summary: BenchPress is a system with significant practical value that effectively addresses efficiency and quality challenges in creating enterprise-level text-to-SQL benchmarks through innovative human-in-the-loop design. This work is not only technically innovative but, more importantly, provides practical tools for safe deployment of enterprise AI applications, demonstrating strong academic and commercial value.