2025-11-10T03:03:44.502546

BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Wenz, Bouattour, Yang et al.

Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

academic

BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

基本信息

论文ID: 2510.13853
标题: BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation
作者: Fabian Wenz (TU Munich & MIT), Omar Bouattour (TU Munich & MIT), Devin Yang (MIT), Justin Choi (MIT), Cecil Gregg (MIT), Nesime Tatbul (Intel Labs & MIT), Çağatay Demiralp (AWS AI Labs & MIT)
分类: cs.CL, cs.AI, cs.DB, cs.HC
发表会议: CIDR 2026 (16th Annual Conference on Innovative Data Systems Research)
论文链接: https://arxiv.org/abs/2510.13853

摘要

大型语言模型（LLMs）已成功应用于包括文本到SQL生成在内的多项任务。然而，大部分工作集中在公开数据集（如Fiben、Spider和Bird）上。作者之前的工作表明，LLMs在查询大型私有企业数据仓库时效果显著下降，并发布了首个私有企业文本到SQL基准Beaver。为解决SQL日志手动标注的挑战，本文提出BenchPress——一个人机协作系统，旨在加速领域特定文本到SQL基准的创建。该系统使用检索增强生成（RAG）和LLMs为SQL查询生成多个自然语言描述，人类专家随后选择、排序或编辑这些草稿以确保准确性和领域对齐。实验表明，BenchPress显著减少了创建高质量基准所需的时间和精力。

研究背景与动机

核心问题

公开基准与企业现实的差距：虽然LLMs在Spider、Bird、Fiben等公开数据集上表现优异，但在企业数据仓库上的执行准确率急剧下降（如图1所示，从90%+下降到接近0%）
企业SQL日志标注困难：手动为SQL查询创建对应的自然语言问题既耗时又昂贵，需要高技能的数据库管理员参与
领域特定挑战：企业数据具有复杂的模式、领域特定术语、隐私约束等特点