2025-11-20T20:49:21.880729

LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Piao, Lee, Park

The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.

academic

LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Basic Information

Paper ID: 2510.09014
Title: LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction
Authors: Shengmin Piao, Jieun Lee, Sanghyun Park (Yonsei University)
Category: cs.CL (Computational Linguistics)
Publication Date: October 2024
Paper Link: https://arxiv.org/abs/2510.09014

Abstract

The Text-to-SQL task converts natural language questions into SQL queries, providing an intuitive database interaction method for non-expert users. While Large Language Model (LLM)-based approaches demonstrate strong performance, their dependence on proprietary models raises concerns about deployment feasibility and data privacy. This paper proposes LitE-SQL, a lightweight and efficient framework comprising two core components: (i) Schema Retriever, which performs efficient schema linking using a vector database with pre-computed schema embeddings; (ii) SQL Generator, which achieves self-correction through two-stage fine-tuning (supervised fine-tuning + execution-guided reinforcement learning) without expensive multi-candidate generation. On the BIRD dataset, LitE-SQL achieves 72.10% execution accuracy, and 88.45% on Spider 1.0, with performance comparable to or superior to LLM-based methods while using only 1/2 to 1/30 of their parameters.

Research Background and Motivation

Problem Definition

The Text-to-SQL task aims to convert natural language questions into corresponding SQL queries, lowering the barrier for non-professional users to access structured databases. This task has significant practical value but faces challenges in cross-domain generalization and complex query generation.

Limitations of Existing Methods

LLM Dependency Issue: Current mainstream methods rely on proprietary large models such as GPT-4 and Gemini, posing risks of data privacy leakage and incurring high deployment costs
Computational Resource Consumption: Inputting complete schema information causes context length to surge, and the quadratic complexity of self-attention mechanisms results in massive memory consumption
Multi-candidate Generation Overhead: Existing methods generate multiple candidate queries and select the optimal one, resulting in significant computational costs

Research Motivation

Addressing the above issues, this paper aims to develop a lightweight and efficient Text-to-SQL framework that maintains competitive performance while significantly reducing parameter count and computational cost, suitable for privacy-sensitive and resource-constrained scenarios.

Core Contributions

Proposes LitE-SQL Framework: The first schema linking method fully leveraging vector database-driven approaches, combined with a lightweight SQL generator
Innovative HN-SupCon Loss Function: Optimizes embedding space through supervised contrastive learning with hard negative sample filtering
Two-stage Training Strategy: Supervised fine-tuning + execution-guided reinforcement learning enables efficient self-correction
Significant Efficiency Gains: Achieves competitive performance on BIRD and Spider 1.0 datasets with parameters only 1/2 to 1/30 of existing methods

Methodology Details

Task Definition

Given a natural language question Q and database schema S, the Text-to-SQL task requires generating a SQL query such that its execution result on the target database is consistent with the gold-standard query.

Model Architecture

1. Schema Retriever

Core Design:

Encodes each column as a dense embedding containing column name, description, table name, and value descriptions
Pre-computes schema embeddings and stores them in a vector database
At inference time, only encodes the question and retrieves top-k relevant columns via cosine similarity

HN-SupCon Loss Function:

L_HN-SupCon = -1/B ∑(i=1 to B) log(e^(s(qi,pi)/τ) / Zi)

Zi = e^(s(qi,pi)/τ) + ∑(j=1 to Ni) mij * e^(s(qi,nij)/τ)

mij = {1 if qi⊙nij ≥ qi⊙pi - 0.1, 0 otherwise}

Where s(·,·) denotes cosine similarity, τ is the temperature parameter, and mij is a masking function that filters simple negative samples to focus on hard negatives—semantically similar but functionally irrelevant samples.

2. SQL Generator

Two-stage Training Strategy:

Stage 1: Supervised Fine-Tuning (SFT)

L_SFT(θ) = -log P(SQL | Q, S; θ)

Learns conditional mapping from natural language questions and schema information to SQL queries
Performs data augmentation by randomly sampling irrelevant schema information to ensure consistency between training and inference

Stage 2: Reinforcement Fine-Tuning (RFT) Uses Direct Preference Optimization (DPO):

L_RFT(πθ;π0) = L_DPO(y^w_i, y^l_i|xi) + αL_NLL(y^w_i|xi)

Constructs preference pairs based on execution results: successfully executed queries are preferred over failed ones
Combines error messages for self-correction training

Technical Innovations

Vector Database-Driven Schema Linking: Unlike existing methods that re-encode schema at each step, this approach only encodes the question, significantly improving efficiency
Hard Negative Sample Filtering Mechanism: The HN-SupCon loss focuses on distinguishing semantically similar but functionally irrelevant columns, improving retrieval quality
Execution-Guided Self-Correction: Leverages SQL execution feedback for reinforcement learning, avoiding computational overhead of multi-candidate generation

Experimental Setup

Datasets

BIRD: 95 large-scale databases, 37 professional domains, 9,376 training samples, 1,534 validation samples
Spider 1.0: 200 databases, 138 domains, 8,659 training samples, 1,034 validation samples, 2,147 test samples

Evaluation Metrics

Execution Accuracy (EX): Consistency between predicted SQL and gold-standard SQL execution results
True Positive Rate (TPR): Ratio of retrieved relevant columns to gold-standard relevant columns
False Positive Rate (FPR): Ratio of retrieved irrelevant columns to total retrieved columns
Schema Linking Recall (SLR): Proportion of queries where all relevant columns are completely retrieved

Baseline Methods

In-context Learning Methods: ChatGPT+CoT, DIN-SQL, DAIL-SQL, CHESS, CHASE-SQL, etc.
Fine-tuning Methods: CodeS, OmniSQL, DTS-SQL, Reasoning-SQL, etc.

Implementation Details

Embedding Model: Qwen3-0.6B-Embedding
SQL Generator: Qwen2.5-Coder (1.5B, 3B, 7B)
Vector Database: ChromaDB
Training Setup: 4 A100 GPUs, AdamW optimizer, LoRA adaptation

Experimental Results

Main Results

Method Category	Model	Parameters	BIRD(Dev) EX	Spider 1.0(Test) EX
In-context Learning
CHASE-SQL	Gemini 1.5	200B	73.01	87.60
MCS-SQL	GPT-4	175B	63.36	89.60
Fine-tuning Methods
Reasoning-SQL	Qwen2.5-Coder-14B	14B	72.29	81.43
LitE-SQL	Qwen2.5-Coder-7B	7B	72.10	88.45

Key Findings

Parameter Efficiency: 7B model outperforms most LLM methods with 175B-200B parameters
Cross-domain Generalization: Surpasses MCS-SQL by 8.74% on BIRD, trailing by only 1.15% on Spider
Consistent Performance: Achieves average improvements of 10.87% (BIRD) and 7.21% (Spider) compared to same-scale fine-tuning methods

Ablation Study

Component Configuration	BIRD EX	Spider EX	Improvement
Baseline (no retriever + generator)	39.31	61.61	-
+Schema Retriever	43.16	64.28	+3.85/+2.67
+SFT	58.21	83.56	+18.90/+21.95
+RFT	60.56	84.35	+21.25/+22.74

Schema Linking Performance Analysis

Comparison with baseline methods (subsampled BIRD dataset):

LitE-SQL: TPR=95.23%, FPR=80.28%, SLR=82.31%, EX=56.46%
CHESS: TPR=87.15%, FPR=8.27%, SLR=61.9%, EX=57.14%
CodeS: TPR=89.64%, FPR=74.16%, SLR=65.31%, EX=51.70%

Despite higher FPR, the SLR advantage compensates for the impact of false positives, achieving performance comparable to 200B models using only 0.6B parameters.

Self-Correction Effect Analysis

Diminishing Iteration Returns: First self-correction provides maximum improvement, with subsequent iterations showing diminishing gains
Error Type Improvement: Syntax errors, non-existent columns, non-existent tables, and other error types all decrease significantly
Scale Effect: Larger models benefit more in semantic alignment

Schema Linking Research

Early Methods: Classifier-based column ranking
LLM Methods: Multi-step prompting, multi-agent frameworks (CHESS)
This Paper's Innovation: First schema linking method fully based on vector databases

SQL Generation Research

In-context Learning: Structured prompting, few-shot learning, self-consistency
Fine-tuning Methods: Domain adaptation, data augmentation, task decomposition
This Paper's Contribution: Execution-guided reinforcement learning self-correction mechanism

Conclusions and Discussion

Main Conclusions

Lightweight Feasibility: Demonstrates that high-quality Text-to-SQL generation is achievable with lightweight models
Efficiency-Performance Balance: Maintains competitive performance while significantly reducing parameter count
Practical Value: Provides practical solutions for privacy-sensitive and resource-constrained scenarios

Limitations

Fixed k-value Issue: Retrieving a fixed number of columns inevitably introduces false positives
Semantic Error Detection: Current self-correction mechanisms primarily handle syntax errors, with limited effectiveness on semantically correct but logically flawed queries

Future Directions

Dynamic Retrieval Strategy: Adaptively adjust the number of retrieved columns based on question complexity
Semantic Error Detection: Develop mechanisms to capture semantic errors
Multimodal Extension: Incorporate table content and schema information

In-depth Evaluation

Strengths

Strong Innovation: First systematic application of vector databases to schema linking in Text-to-SQL
High Practical Value: Addresses privacy and deployment issues of LLM-based methods
Comprehensive Experiments: Thorough ablation studies and error analysis
Solid Technical Foundation: Well-designed HN-SupCon loss and two-stage training strategy

Weaknesses

Simple Retrieval Strategy: Fixed k-value retrieval may not be optimal
Limited Error Types: Self-correction primarily targets errors detectable through execution
Dataset Limitations: Validation primarily on English datasets; multilingual generalization capability unknown

Impact

Academic Value: Provides new perspectives for lightweight Text-to-SQL research
Practical Value: Applicable to edge computing and privacy protection scenarios
Reproducibility: Based on open-source models, easy to reproduce and extend

Applicable Scenarios

Resource-Constrained Environments: Edge devices, mobile applications
Privacy-Sensitive Scenarios: Enterprise databases, medical and financial domains
Real-time Applications: Interactive query systems requiring rapid response

References

The paper cites important works in the Text-to-SQL domain, including:

Original papers of Spider and BIRD benchmark datasets
Representative LLM-based methods (DIN-SQL, CHESS, CHASE-SQL, etc.)
Representative fine-tuning method works (CodeS, OmniSQL, etc.)
Related technical foundations (DPO, LoRA, contrastive learning, etc.)