2025-11-20T20:49:21.880729

LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Piao, Lee, Park
The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
academic

LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Basic Information

  • Paper ID: 2510.09014
  • Title: LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction
  • Authors: Shengmin Piao, Jieun Lee, Sanghyun Park (Yonsei University)
  • Category: cs.CL (Computational Linguistics)
  • Publication Date: October 2024
  • Paper Link: https://arxiv.org/abs/2510.09014

Abstract

The Text-to-SQL task converts natural language questions into SQL queries, providing an intuitive database interaction method for non-expert users. While Large Language Model (LLM)-based approaches demonstrate strong performance, their dependence on proprietary models raises concerns about deployment feasibility and data privacy. This paper proposes LitE-SQL, a lightweight and efficient framework comprising two core components: (i) Schema Retriever, which performs efficient schema linking using a vector database with pre-computed schema embeddings; (ii) SQL Generator, which achieves self-correction through two-stage fine-tuning (supervised fine-tuning + execution-guided reinforcement learning) without expensive multi-candidate generation. On the BIRD dataset, LitE-SQL achieves 72.10% execution accuracy, and 88.45% on Spider 1.0, with performance comparable to or superior to LLM-based methods while using only 1/2 to 1/30 of their parameters.

Research Background and Motivation

Problem Definition

The Text-to-SQL task aims to convert natural language questions into corresponding SQL queries, lowering the barrier for non-professional users to access structured databases. This task has significant practical value but faces challenges in cross-domain generalization and complex query generation.

Limitations of Existing Methods

  1. LLM Dependency Issue: Current mainstream methods rely on proprietary large models such as GPT-4 and Gemini, posing risks of data privacy leakage and incurring high deployment costs
  2. Computational Resource Consumption: Inputting complete schema information causes context length to surge, and the quadratic complexity of self-attention mechanisms results in massive memory consumption
  3. Multi-candidate Generation Overhead: Existing methods generate multiple candidate queries and select the optimal one, resulting in significant computational costs

Research Motivation

Addressing the above issues, this paper aims to develop a lightweight and efficient Text-to-SQL framework that maintains competitive performance while significantly reducing parameter count and computational cost, suitable for privacy-sensitive and resource-constrained scenarios.

Core Contributions

  1. Proposes LitE-SQL Framework: The first schema linking method fully leveraging vector database-driven approaches, combined with a lightweight SQL generator
  2. Innovative HN-SupCon Loss Function: Optimizes embedding space through supervised contrastive learning with hard negative sample filtering
  3. Two-stage Training Strategy: Supervised fine-tuning + execution-guided reinforcement learning enables efficient self-correction
  4. Significant Efficiency Gains: Achieves competitive performance on BIRD and Spider 1.0 datasets with parameters only 1/2 to 1/30 of existing methods

Methodology Details

Task Definition

Given a natural language question Q and database schema S, the Text-to-SQL task requires generating a SQL query such that its execution result on the target database is consistent with the gold-standard query.

Model Architecture

1. Schema Retriever

Core Design:

  • Encodes each column as a dense embedding containing column name, description, table name, and value descriptions
  • Pre-computes schema embeddings and stores them in a vector database
  • At inference time, only encodes the question and retrieves top-k relevant columns via cosine similarity

HN-SupCon Loss Function:

L_HN-SupCon = -1/B ∑(i=1 to B) log(e^(s(qi,pi)/τ) / Zi)

Zi = e^(s(qi,pi)/τ) + ∑(j=1 to Ni) mij * e^(s(qi,nij)/τ)

mij = {1 if qi⊙nij ≥ qi⊙pi - 0.1, 0 otherwise}

Where s(·,·) denotes cosine similarity, τ is the temperature parameter, and mij is a masking function that filters simple negative samples to focus on hard negatives—semantically similar but functionally irrelevant samples.

2. SQL Generator

Two-stage Training Strategy:

Stage 1: Supervised Fine-Tuning (SFT)

L_SFT(θ) = -log P(SQL | Q, S; θ)
  • Learns conditional mapping from natural language questions and schema information to SQL queries
  • Performs data augmentation by randomly sampling irrelevant schema information to ensure consistency between training and inference

Stage 2: Reinforcement Fine-Tuning (RFT) Uses Direct Preference Optimization (DPO):

L_RFT(πθ;π0) = L_DPO(y^w_i, y^l_i|xi) + αL_NLL(y^w_i|xi)
  • Constructs preference pairs based on execution results: successfully executed queries are preferred over failed ones
  • Combines error messages for self-correction training

Technical Innovations

  1. Vector Database-Driven Schema Linking: Unlike existing methods that re-encode schema at each step, this approach only encodes the question, significantly improving efficiency
  2. Hard Negative Sample Filtering Mechanism: The HN-SupCon loss focuses on distinguishing semantically similar but functionally irrelevant columns, improving retrieval quality
  3. Execution-Guided Self-Correction: Leverages SQL execution feedback for reinforcement learning, avoiding computational overhead of multi-candidate generation

Experimental Setup

Datasets

  • BIRD: 95 large-scale databases, 37 professional domains, 9,376 training samples, 1,534 validation samples
  • Spider 1.0: 200 databases, 138 domains, 8,659 training samples, 1,034 validation samples, 2,147 test samples

Evaluation Metrics

  1. Execution Accuracy (EX): Consistency between predicted SQL and gold-standard SQL execution results
  2. True Positive Rate (TPR): Ratio of retrieved relevant columns to gold-standard relevant columns
  3. False Positive Rate (FPR): Ratio of retrieved irrelevant columns to total retrieved columns
  4. Schema Linking Recall (SLR): Proportion of queries where all relevant columns are completely retrieved

Baseline Methods

  • In-context Learning Methods: ChatGPT+CoT, DIN-SQL, DAIL-SQL, CHESS, CHASE-SQL, etc.
  • Fine-tuning Methods: CodeS, OmniSQL, DTS-SQL, Reasoning-SQL, etc.

Implementation Details

  • Embedding Model: Qwen3-0.6B-Embedding
  • SQL Generator: Qwen2.5-Coder (1.5B, 3B, 7B)
  • Vector Database: ChromaDB
  • Training Setup: 4 A100 GPUs, AdamW optimizer, LoRA adaptation

Experimental Results

Main Results

Method CategoryModelParametersBIRD(Dev) EXSpider 1.0(Test) EX
In-context Learning
CHASE-SQLGemini 1.5200B73.0187.60
MCS-SQLGPT-4175B63.3689.60
Fine-tuning Methods
Reasoning-SQLQwen2.5-Coder-14B14B72.2981.43
LitE-SQLQwen2.5-Coder-7B7B72.1088.45

Key Findings

  1. Parameter Efficiency: 7B model outperforms most LLM methods with 175B-200B parameters
  2. Cross-domain Generalization: Surpasses MCS-SQL by 8.74% on BIRD, trailing by only 1.15% on Spider
  3. Consistent Performance: Achieves average improvements of 10.87% (BIRD) and 7.21% (Spider) compared to same-scale fine-tuning methods

Ablation Study

Component ConfigurationBIRD EXSpider EXImprovement
Baseline (no retriever + generator)39.3161.61-
+Schema Retriever43.1664.28+3.85/+2.67
+SFT58.2183.56+18.90/+21.95
+RFT60.5684.35+21.25/+22.74

Schema Linking Performance Analysis

Comparison with baseline methods (subsampled BIRD dataset):

  • LitE-SQL: TPR=95.23%, FPR=80.28%, SLR=82.31%, EX=56.46%
  • CHESS: TPR=87.15%, FPR=8.27%, SLR=61.9%, EX=57.14%
  • CodeS: TPR=89.64%, FPR=74.16%, SLR=65.31%, EX=51.70%

Despite higher FPR, the SLR advantage compensates for the impact of false positives, achieving performance comparable to 200B models using only 0.6B parameters.

Self-Correction Effect Analysis

  • Diminishing Iteration Returns: First self-correction provides maximum improvement, with subsequent iterations showing diminishing gains
  • Error Type Improvement: Syntax errors, non-existent columns, non-existent tables, and other error types all decrease significantly
  • Scale Effect: Larger models benefit more in semantic alignment

Schema Linking Research

  1. Early Methods: Classifier-based column ranking
  2. LLM Methods: Multi-step prompting, multi-agent frameworks (CHESS)
  3. This Paper's Innovation: First schema linking method fully based on vector databases

SQL Generation Research

  1. In-context Learning: Structured prompting, few-shot learning, self-consistency
  2. Fine-tuning Methods: Domain adaptation, data augmentation, task decomposition
  3. This Paper's Contribution: Execution-guided reinforcement learning self-correction mechanism

Conclusions and Discussion

Main Conclusions

  1. Lightweight Feasibility: Demonstrates that high-quality Text-to-SQL generation is achievable with lightweight models
  2. Efficiency-Performance Balance: Maintains competitive performance while significantly reducing parameter count
  3. Practical Value: Provides practical solutions for privacy-sensitive and resource-constrained scenarios

Limitations

  1. Fixed k-value Issue: Retrieving a fixed number of columns inevitably introduces false positives
  2. Semantic Error Detection: Current self-correction mechanisms primarily handle syntax errors, with limited effectiveness on semantically correct but logically flawed queries

Future Directions

  1. Dynamic Retrieval Strategy: Adaptively adjust the number of retrieved columns based on question complexity
  2. Semantic Error Detection: Develop mechanisms to capture semantic errors
  3. Multimodal Extension: Incorporate table content and schema information

In-depth Evaluation

Strengths

  1. Strong Innovation: First systematic application of vector databases to schema linking in Text-to-SQL
  2. High Practical Value: Addresses privacy and deployment issues of LLM-based methods
  3. Comprehensive Experiments: Thorough ablation studies and error analysis
  4. Solid Technical Foundation: Well-designed HN-SupCon loss and two-stage training strategy

Weaknesses

  1. Simple Retrieval Strategy: Fixed k-value retrieval may not be optimal
  2. Limited Error Types: Self-correction primarily targets errors detectable through execution
  3. Dataset Limitations: Validation primarily on English datasets; multilingual generalization capability unknown

Impact

  1. Academic Value: Provides new perspectives for lightweight Text-to-SQL research
  2. Practical Value: Applicable to edge computing and privacy protection scenarios
  3. Reproducibility: Based on open-source models, easy to reproduce and extend

Applicable Scenarios

  1. Resource-Constrained Environments: Edge devices, mobile applications
  2. Privacy-Sensitive Scenarios: Enterprise databases, medical and financial domains
  3. Real-time Applications: Interactive query systems requiring rapid response

References

The paper cites important works in the Text-to-SQL domain, including:

  • Original papers of Spider and BIRD benchmark datasets
  • Representative LLM-based methods (DIN-SQL, CHESS, CHASE-SQL, etc.)
  • Representative fine-tuning method works (CodeS, OmniSQL, etc.)
  • Related technical foundations (DPO, LoRA, contrastive learning, etc.)