"Text2SQL" how LLM's enable this and why this is transformative for Businesses

"Text2SQL" how LLM's enable this and why this is transformative for Businesses

Today I am going to write about "Text2SQL", what exactly is text2sql is, why this is needed and how this is hugely transformative for businesses.

So, let's dive in.

What is text2sql?

Text2SQL in one line is a feature to generate SQL Queries from the questions asked in natural language. These SQL queries can then be fired against the Database to get the desired results.

text2sql has the ability to generate pretty complex queries, this based on the questions asked. Joining multiple tables from various schema and what not. text2sql is quite powerful from a business results generation perspective.

But how does text2sql works and where does LLM comes into picture? Isn't this question coming to your mind?

text2sql can be achieved through basically two ways.

  • Fine tuning/Instruction Tuning (Supervised fine tuning) some existing llm on the textsql data and then using this llm to generate SQL queries.
  • Prompt Engineering llms to generate SQL Queries. Remember llm's are zero shot learners.

Let me explain text2sql by an example of fine tuning a llm ontext2sql task (option 1 as above)

Three steps are required for this.

  • Get the Database.
  • Generate Instructions Data for LLM for fine tuning using the database
  • Fine tune llm
  • Generate the queries using the fine-tuned LLM and execute these generated queries on your DB. (this step can be through LangChain or LLamaIndex Framework)

For fine tuning llm on text2sql data there are lot of data bases available that can be used. We will use "Spider"(Spider: Yale Semantic Parsing and Text-to-SQL Challenge).

Spider DB has lot of DB schemas on various topics and also has corresponding data for the schema tables. We can use anyone schema.

Step 1 - Create Database (using DDL) using this schema.

create table Student (
        StuID       INTEGER PRIMARY KEY,
        LName    VARCHAR(12),
        Fname     VARCHAR(12),
        Age          INTEGER,
        Sex          VARCHAR(1),
        Major      INTEGER,
        FacID      INTEGER,
  );

create table Faculty (
       FacID 	       INTEGER PRIMARY KEY,
       Lname		VARCHAR(15),
       Fname		VARCHAR(15),
       Rank		        VARCHAR(15),
       Sex		        VARCHAR(1),
);
insert into Faculty  values ( 1082, 'Giuliano', 'Mark', 'Instructor', 'M'');
insert into Student values ( 1001, 'Smith', 'Linda', 18, 'F', 600,1082);        

Step 2 - Generate text2sql samples from the above DB. This to be used to fine tune llm.

For this we can use any llm such as llama3 anyone

Sample format:

{

"instruction": "Act as SQL terminal in front of an example database. You need only to return the SQL command to me. Below is an instruction that describes a task. Write a response that appropriately completes the request.

## Instruction:

[Database description]",

"input": "### Input: [Natural language question]

### Response:",

"output": "[Corresponding SQL query]"

}

Sample Data for above instruction (for llm to generate text2sql data samples):

{

"instruction": "You are a powerful text-to-SQL model. Your task is to generate SQL queries based on the following schema for a faculty and student database:

\n\nCREATE TABLE Faculty ().

\n\nCREATE TABLE Student (FOREIGN KEY (customer_key) REFERENCES customer(customer_key)\n);

"input": "List the full names of students who have multiple faculties assigned.",

"output": "SELECT DISTINCT LName , Fname FROM Student s \n JOIN Faculty f ON s.FacID = f.FacID AND s.FacID IS NOT NULL;"

}

The database contains two tables: student and faculty.

Please generate 100 samples by taking tables student and faculty. Each sample should be in the format of:

"instruction"

"input": as natural language question on the data

"output": as query answering above language question on the data

=============================================================

If you note carefully, we have instructions and we have instruction samples data for LLM to generate text2sql synthetic samples for later fine tuning of llm.

Using the above prompt, LLM generates following samples for fine tuning the llm.

[

{ "instruction": "You are a powerful text-to-SQL model. Your task is to generate SQL queries based on the following schema for a faculty and student database:"

"input": "List the full names of all students born after January 1, 1990.",

"output": "SELECT full_name FROM students WHERE Age> '1990-01-01';"},

]

=============================================================

The above is example of one such data sample generated. We can have fairly complex sample query dataset generated using above prompt. Just for the sake of the article I am restricting this to one sample

Step 3 - Fine tune llm using above sample dataset.

  • Do a Supervised fine tuning of any LLM (such as llama3, Mistral etc.) on the above dataset
  • Leverage Hugging Face SFT trainer. Leverage QLora

I am not covering fine tuning in many details here in this article as this is pretty straightforward process and not focus of this article.

Step 4 - Integrate this fine-tuned llm with Langchain or LlamaIndex basically. (This is a KEY step towards business transformation.)

  • Create Tables Student and Faculty in DB in Langchain
  • Create the SQL query generation chain using the language model and DB as Langchain create_sql_query_chain(llm, DB).

  • Create a template for final response for the query generated by LLM and final Business response to be generated by LLM. - template = """Based on the table schema, question, sql query, and sql response, write a natural language response:}

{schema}:

Question: {question}

SQL Query: {query}

SQL Response: {response}"""

Use the above prompt for generating the response prompt_response = ChatPromptTemplate.from_template(template)

Step 5 above generates a business response for the query. This is where llms shine.

Let's now move to final section to cover how text2sql is hugely transformative for businesses.

3 key points here.

  • Reduction in effort for sql queries generation. Assume you are an enterprise /b2c platform business how much effort will this save. llms generating queries for you. Massive saving in hours and hence cost
  • Better Quality Insights generation as llm learns the quality of insight generation will see massive improvement
  • Enabling Real time Complex insights for businesses as complex queries generation would be in real time by llm for complex business questions and not waiting on data teams to first write queries (time taking) and then respond.

In summary text2sql is massive from a business transformation aspect.

Thanks All. Hope you had a good read.

Disclaimer: Opinion / Views expressed above are the author's personal and has no bearing or affiliation to the authors current employer or any earlier/past employers.

Credit:

https://yia333.medium.com/enhancing-text-to-sql-with-a-fine-tuned-7b-llm-for-database-interactions-fa754dc2e992

Image Credit:

https://yia333.medium.com/enhancing-text-to-sql-with-a-fine-tuned-7b-llm-for-database-interactions-fa754dc2e992



要查看或添加评论,请登录

Nikhil Goel的更多文章

社区洞察

其他会员也浏览了