LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction
Piao, Lee, Park
The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
academic
LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction
The Text-to-SQL task converts natural language questions into SQL queries, providing an intuitive database interaction method for non-expert users. While Large Language Model (LLM)-based approaches demonstrate strong performance, their dependence on proprietary models raises concerns about deployment feasibility and data privacy. This paper proposes LitE-SQL, a lightweight and efficient framework comprising two core components: (i) Schema Retriever, which performs efficient schema linking using a vector database with pre-computed schema embeddings; (ii) SQL Generator, which achieves self-correction through two-stage fine-tuning (supervised fine-tuning + execution-guided reinforcement learning) without expensive multi-candidate generation. On the BIRD dataset, LitE-SQL achieves 72.10% execution accuracy, and 88.45% on Spider 1.0, with performance comparable to or superior to LLM-based methods while using only 1/2 to 1/30 of their parameters.
The Text-to-SQL task aims to convert natural language questions into corresponding SQL queries, lowering the barrier for non-professional users to access structured databases. This task has significant practical value but faces challenges in cross-domain generalization and complex query generation.
LLM Dependency Issue: Current mainstream methods rely on proprietary large models such as GPT-4 and Gemini, posing risks of data privacy leakage and incurring high deployment costs
Computational Resource Consumption: Inputting complete schema information causes context length to surge, and the quadratic complexity of self-attention mechanisms results in massive memory consumption
Multi-candidate Generation Overhead: Existing methods generate multiple candidate queries and select the optimal one, resulting in significant computational costs
Addressing the above issues, this paper aims to develop a lightweight and efficient Text-to-SQL framework that maintains competitive performance while significantly reducing parameter count and computational cost, suitable for privacy-sensitive and resource-constrained scenarios.
Proposes LitE-SQL Framework: The first schema linking method fully leveraging vector database-driven approaches, combined with a lightweight SQL generator
Innovative HN-SupCon Loss Function: Optimizes embedding space through supervised contrastive learning with hard negative sample filtering
Given a natural language question Q and database schema S, the Text-to-SQL task requires generating a SQL query such that its execution result on the target database is consistent with the gold-standard query.
Encodes each column as a dense embedding containing column name, description, table name, and value descriptions
Pre-computes schema embeddings and stores them in a vector database
At inference time, only encodes the question and retrieves top-k relevant columns via cosine similarity
HN-SupCon Loss Function:
L_HN-SupCon = -1/B ∑(i=1 to B) log(e^(s(qi,pi)/τ) / Zi)
Zi = e^(s(qi,pi)/τ) + ∑(j=1 to Ni) mij * e^(s(qi,nij)/τ)
mij = {1 if qi⊙nij ≥ qi⊙pi - 0.1, 0 otherwise}
Where s(·,·) denotes cosine similarity, τ is the temperature parameter, and mij is a masking function that filters simple negative samples to focus on hard negatives—semantically similar but functionally irrelevant samples.
Vector Database-Driven Schema Linking: Unlike existing methods that re-encode schema at each step, this approach only encodes the question, significantly improving efficiency
Hard Negative Sample Filtering Mechanism: The HN-SupCon loss focuses on distinguishing semantically similar but functionally irrelevant columns, improving retrieval quality
Execution-Guided Self-Correction: Leverages SQL execution feedback for reinforcement learning, avoiding computational overhead of multi-candidate generation
Despite higher FPR, the SLR advantage compensates for the impact of false positives, achieving performance comparable to 200B models using only 0.6B parameters.
Fixed k-value Issue: Retrieving a fixed number of columns inevitably introduces false positives
Semantic Error Detection: Current self-correction mechanisms primarily handle syntax errors, with limited effectiveness on semantically correct but logically flawed queries