"Text2SQL" how LLM's enable this and why this is transformative for Businesses
Today I am going to write about "Text2SQL", what exactly is text2sql is, why this is needed and how this is hugely transformative for businesses.
So, let's dive in.
What is text2sql?
Text2SQL in one line is a feature to generate SQL Queries from the questions asked in natural language. These SQL queries can then be fired against the Database to get the desired results.
text2sql has the ability to generate pretty complex queries, this based on the questions asked. Joining multiple tables from various schema and what not. text2sql is quite powerful from a business results generation perspective.
But how does text2sql works and where does LLM comes into picture? Isn't this question coming to your mind?
text2sql can be achieved through basically two ways.
Let me explain text2sql by an example of fine tuning a llm ontext2sql task (option 1 as above)
Three steps are required for this.
For fine tuning llm on text2sql data there are lot of data bases available that can be used. We will use "Spider"(Spider: Yale Semantic Parsing and Text-to-SQL Challenge).
Spider DB has lot of DB schemas on various topics and also has corresponding data for the schema tables. We can use anyone schema.
Step 1 - Create Database (using DDL) using this schema.
create table Student (
StuID INTEGER PRIMARY KEY,
LName VARCHAR(12),
Fname VARCHAR(12),
Age INTEGER,
Sex VARCHAR(1),
Major INTEGER,
FacID INTEGER,
);
create table Faculty (
FacID INTEGER PRIMARY KEY,
Lname VARCHAR(15),
Fname VARCHAR(15),
Rank VARCHAR(15),
Sex VARCHAR(1),
);
insert into Faculty values ( 1082, 'Giuliano', 'Mark', 'Instructor', 'M'');
insert into Student values ( 1001, 'Smith', 'Linda', 18, 'F', 600,1082);
Step 2 - Generate text2sql samples from the above DB. This to be used to fine tune llm.
For this we can use any llm such as llama3 anyone
Sample format:
{
"instruction": "Act as SQL terminal in front of an example database. You need only to return the SQL command to me. Below is an instruction that describes a task. Write a response that appropriately completes the request.
## Instruction:
[Database description]",
"input": "### Input: [Natural language question]
### Response:",
"output": "[Corresponding SQL query]"
}
Sample Data for above instruction (for llm to generate text2sql data samples):
{
"instruction": "You are a powerful text-to-SQL model. Your task is to generate SQL queries based on the following schema for a faculty and student database:
\n\nCREATE TABLE Faculty ().
\n\nCREATE TABLE Student (FOREIGN KEY (customer_key) REFERENCES customer(customer_key)\n);
"input": "List the full names of students who have multiple faculties assigned.",
"output": "SELECT DISTINCT LName , Fname FROM Student s \n JOIN Faculty f ON s.FacID = f.FacID AND s.FacID IS NOT NULL;"
}
The database contains two tables: student and faculty.
Please generate 100 samples by taking tables student and faculty. Each sample should be in the format of:
"instruction"
"input": as natural language question on the data
"output": as query answering above language question on the data
=============================================================
If you note carefully, we have instructions and we have instruction samples data for LLM to generate text2sql synthetic samples for later fine tuning of llm.
Using the above prompt, LLM generates following samples for fine tuning the llm.
[
{ "instruction": "You are a powerful text-to-SQL model. Your task is to generate SQL queries based on the following schema for a faculty and student database:"
"input": "List the full names of all students born after January 1, 1990.",
"output": "SELECT full_name FROM students WHERE Age> '1990-01-01';"},
]
=============================================================
The above is example of one such data sample generated. We can have fairly complex sample query dataset generated using above prompt. Just for the sake of the article I am restricting this to one sample
Step 3 - Fine tune llm using above sample dataset.
I am not covering fine tuning in many details here in this article as this is pretty straightforward process and not focus of this article.
Step 4 - Integrate this fine-tuned llm with Langchain or LlamaIndex basically. (This is a KEY step towards business transformation.)
{schema}:
Question: {question}
SQL Query: {query}
SQL Response: {response}"""
Use the above prompt for generating the response prompt_response = ChatPromptTemplate.from_template(template)
Step 5 above generates a business response for the query. This is where llms shine.
Let's now move to final section to cover how text2sql is hugely transformative for businesses.
3 key points here.
In summary text2sql is massive from a business transformation aspect.
Thanks All. Hope you had a good read.
Disclaimer: Opinion / Views expressed above are the author's personal and has no bearing or affiliation to the authors current employer or any earlier/past employers.
Credit:
Image Credit: