2025-11-10T03:03:44.502546

BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Wenz, Bouattour, Yang et al.

Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

academic

BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Basic Information

Paper ID: 2510.13853
Title: BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
Authors: Fabian Wenz (TU Munich & MIT), Omar Bouattour (TU Munich & MIT), Devin Yang (MIT), Justin Choi (MIT), Cecil Gregg (MIT), Nesime Tatbul (Intel Labs & MIT), Çağatay Demiralp (AWS AI Labs & MIT)
Classification: cs.CL, cs.AI, cs.DB, cs.HC
Venue: CIDR 2026 (16th Annual Conference on Innovative Data Systems Research)
Paper Link: https://arxiv.org/abs/2510.13853

Abstract

Large Language Models (LLMs) have been successfully applied to numerous tasks, including text-to-SQL generation. However, most research focuses on public datasets such as Fiben, Spider, and Bird. Prior work by the authors demonstrated that LLMs exhibit significantly degraded performance when querying large private enterprise data warehouses, leading to the publication of Beaver, the first private enterprise text-to-SQL benchmark. To address the challenges of manual annotation of SQL logs, this paper proposes BenchPress—a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. The system employs Retrieval-Augmented Generation (RAG) and LLMs to generate multiple natural language descriptions for SQL queries, which human experts subsequently select, rank, or edit to ensure accuracy and domain alignment. Experimental results demonstrate that BenchPress significantly reduces the time and effort required to create high-quality benchmarks.

Research Background and Motivation

Core Problems

Gap between public benchmarks and enterprise reality: While LLMs perform excellently on public datasets such as Spider, Bird, and Fiben, execution accuracy on enterprise data warehouses drops dramatically (as shown in Figure 1, declining from 90%+ to near 0%)
Difficulty in annotating enterprise SQL logs: Manually creating corresponding natural language questions for SQL queries is both time-consuming and expensive, requiring participation from skilled database administrators
Domain-specific challenges: Enterprise data exhibits complex schemas, domain-specific terminology, privacy constraints, and other distinctive characteristics

Significance

Enterprises require performance evaluation of text-to-SQL models on private data before deployment
Prevention of deployment failures caused by domain mismatch
Support for model domain adaptation and fine-tuning strategy optimization

Limitations of Existing Approaches

Public benchmarks lack enterprise-specific complexity (schema ambiguity, domain terminology, etc.)
Fully manual annotation is costly and inefficient
General-purpose LLMs lack domain context and structured support

Core Contributions

Proposes the BenchPress system: The first human-in-the-loop annotation system specifically designed for rapid creation of domain-specific text-to-SQL benchmarks
Innovative workflow design: A modular architecture combining Retrieval-Augmented Generation (RAG), query decomposition, and human feedback
Comprehensive user study: Comparative experiments demonstrating BenchPress's advantages in annotation accuracy, efficiency, and semantic fidelity
Open-source tool: Provides a readily usable system supporting multiple public benchmarks and enterprise data

Methodology Details

Task Definition

Input: SQL query + database schema + optional historical annotation examples Output: Corresponding natural language description Constraints: Maintain semantic accuracy, domain terminology consistency, and privacy protection

System Architecture

One-time Setup Phase

Project setup: Select or create an annotation project for specific enterprise workloads
Data ingestion: Upload SQL logs and schema files, or select from supported public benchmarks
Task configuration: Choose annotation direction (currently supports SQL-to-NL) and language model

Iterative Annotation Loop

Query decomposition (optional): Rewrite nested SQL queries as series of Common Table Expressions (CTEs)
Context retrieval: Use dense vector embeddings (e.g., Sentence-BERT) to retrieve semantically similar examples and relevant table schemas
Candidate generation: LLM generates four candidate natural language descriptions based on retrieved context
Reassembly (optional): Merge sub-query-level descriptions into complete query explanations
Human feedback: Annotators rank, optimize, or discard LLM outputs
Review and export: Assess output quality and export as benchmark format

Technical Innovations

Retrieval-Augmented Generation (RAG)

Use dense vector search to retrieve semantically similar SQL queries and their annotations
Embed examples in prompts to provide realistic expression patterns and schema usage guidance
Balance informativeness with prompt efficiency by selecting top-k retrieval examples

Query Decomposition Strategy

Decompose structurally complex nested queries
Generate natural language descriptions for sub-queries independently before reassembly
Reduce cognitive load and improve annotation accuracy

Human-in-the-Loop Design

Structured iterative review process ensures enterprise quality standards
Support prompt optimization and feedback-driven improvement loops
Follow responsible AI design principles aligned with Google PAIR guidelines

Experimental Setup

Datasets

Beaver: The first private enterprise text-to-SQL benchmark based on SQL logs from MIT and other institutions, containing 300+ schemas and nearly 4,000 queries
Bird: A large-scale public database benchmark
Total of 30 SQL queries for user study, sourced from Beaver and Bird datasets (anonymized)

Evaluation Metrics

Annotation accuracy: Manual verification of NL description fidelity to SQL queries
Annotation latency: Total annotation time per participant
Semantic fidelity: Assessed through back-translation tasks using 5-level rating scale

Comparison Methods

BenchPress group: Uses complete BenchPress interface
Manual group: Provided only with schema files and logs, no LLM support
Generic LLM group: Uses standard ChatGPT interface, no RAG support

Implementation Details

18 participants stratified by SQL proficiency into advanced and non-advanced levels
Balanced Latin square design ensures counterbalancing
Each participant annotates the same 30 SQL queries

Experimental Results

Main Results

Annotation Accuracy

Method	Beaver	Bird	Overall
BenchPress	86.1%	100.0%	93.0%
Generic LLM	66.2%	100.0%	83.1%
Manual	60.1%	87.8%	73.9%

Annotation Latency

Method	Beaver	Bird	Total
BenchPress	16.1 min	12.0 min	28.1 min
Generic LLM	16.2 min	15.8 min	32.0 min
Manual	102.1 min	82.8 min	183.9 min

Back-translation Fidelity

BenchPress produced the highest proportion of completely correct (Level 5) outputs in 5-level clarity assessment, demonstrating superior semantic clarity.

Experimental Findings

Tool effectiveness: BenchPress outperforms comparison methods across all metrics
Dataset complexity impact: Performance differences between tools are more pronounced on complex enterprise datasets (Beaver)
Domain adaptability: BenchPress excels at handling enterprise-specific terminology and complex schemas

Text-to-SQL Benchmarks

Public benchmarks: Spider, Bird, Fiben, and others advance general text-to-SQL tasks
Enterprise benchmarks: Beaver introduces enterprise-level complexity for the first time, exposing LLM difficulties with heterogeneous schemas

LLM Applications in SQL Generation

Codex, GPT-4, DeepSeek, and others perform strongly on public datasets
However, performance degrades significantly in domain-specific or enterprise environments

Annotation Systems and Tools

Existing systems primarily target public or synthetic data
BenchPress specifically supports human-in-the-loop workflows for private enterprise logs

Conclusions and Discussion

Main Conclusions

BenchPress significantly improves efficiency and quality in creating domain-specific text-to-SQL benchmarks
Human-in-the-loop approaches outperform purely automated or purely manual methods in handling enterprise data complexity
Public benchmarks inadequately reflect the structural and linguistic complexity of enterprise SQL logs

Limitations

Current system primarily focuses on SQL-to-text annotation
Requires domain expert participation, incurring some human cost
Decomposition strategies may be insufficient for extremely complex nested queries

Future Directions

Bidirectional annotation: Integrate text-to-SQL generation to support iterative validation
Robustness assessment: Systematically rephrase natural language queries in existing benchmarks
Automation enhancement: Further reduce manual intervention requirements

In-Depth Evaluation

Strengths

High practical value: Addresses real pain points in enterprise text-to-SQL model deployment
Strong methodological innovation: Cleverly combines RAG, query decomposition, and human-in-the-loop collaboration
Rigorous experimental design: Well-designed comparative experiments with comprehensive evaluation dimensions
Open-source contribution: Provides readily usable tools and resources

Limitations

Limited user study scale: Sample size of 18 participants is relatively small
Domain generalizability: Primarily validated in education and technology sectors; applicability to other industries remains to be verified
Insufficient cost analysis: Lacks detailed cost-benefit analysis

Impact

Academic contribution: Provides new methodology for enterprise AI application evaluation
Practical value: Directly addresses real industry needs
Reproducibility: Open-source code and detailed documentation support reproduction and extension

Applicable Scenarios

Enterprises need to evaluate text-to-SQL model performance on private data
Research institutions construct domain-specific text-to-SQL benchmarks
Data teams optimize model deployment and fine-tuning strategies

References

This paper cites 21 relevant references covering key areas including text-to-SQL benchmarks, LLM applications, annotation systems, and enterprise data challenges, providing a solid theoretical foundation for the research.

Summary: BenchPress is a system with significant practical value that effectively addresses efficiency and quality challenges in creating enterprise-level text-to-SQL benchmarks through innovative human-in-the-loop design. This work is not only technically innovative but, more importantly, provides practical tools for safe deployment of enterprise AI applications, demonstrating strong academic and commercial value.