How to query your SQL data with AI

How to query your SQL data with AI

How quickly can you answer a question such as, “How many customers signed up last week?”? For people that don’t like querying a database using SQL there were typically two ways to answer this question. The first was to have a set of pre-built reports that anticipated these types of questions that people could access.? Once you found the report, you would apply some filtering and then come up with your answer.? If your question was not covered by the reports then you would need to ask for a new report.? The other method is to have some sort of query design tool that lets you link tables and create reports on the fly assuming the data is available and you know how to use the tool.? If you are asking questions infrequently, you may often forget how to use the tool and end up asking for help anyways.

In this article, I show how AI can solve these problems and as well as gain understanding that was beyond the ability of even the best reports.

SQL Vs RAG

Before jumping into the SQL discussion it is important to mention another important way in which LLMs can be updated with external data.? Retrieval Augmented Generation (RAG) is a two step process that first converts data into vectors that can be read by the LMM and then passes selected vectors to the LLM along with the prompt.? This method works well for unstructured data such as PDFs because the system looks for any information that may be related to the prompt and then sends it to the LLM which then decides which data to use in formulating the response. You can read more about this topic in my previous article on the difference between RAG and fine tuning.

If you were to simply dump your SQL data into a vector database, there could be a large number of records that match your prompt and if you tried to pass all the matching records to the LLM it would increase the response time as well as costs.? Additionally there are some questions that can only be solved using SQL such as joins, grouping and summary statistics.? Luckily there are two methods that you can use to access SQL data using LLMs.

SQL Generation

The first method allows you to simply use AI to generate SQL queries and then run the queries against the data.? A Large Language Model (LLM), would interpret your prompt, create a query based on knowledge of the database schema and then the query would be run against your data warehouse as shown below:

Generating SQL queries using AI


In a typical setup, a Data Lake would store all of your raw data.? An Extract, Transform, and Load (ETL) process would pull the critical data and load it into an SQL Data Warehouse.? The data warehouse would have user friendly tables that have the most relevant information for business users.? A prompt template would be set up that has knowledge of the schema of the data warehouse as well as brief description of each of the tables.? This allows the LLM to know what information is stored in each table as well as the proper syntax to use when creating the query.

When the user asks a question, the prompt would be appended with the prompt template and then sent to the LMM that would then create an SQL query.? The query will be automatically sent to the data warehouse and an answer provided.? Safeguards would be needed such as only allowing read only commands and checking the data privileges of the user.? Read more about prompt templates in my introduction article to chatbots and agents.

The benefits of this system is that no training is required, the user just has to ask a question.? If the user does not get the answer they are looking for they can simply refine the question and press enter.? To improve the efficiency of the system, guidelines could be provided that show what kind of information is in the system as well as any limitations such as what data the user has access to.? Answers could also be outputted as csv if the user would like to do further analysis or dumped into some kind of smart report generating tool.

This approach works well if you are looking for a simple summary statistic like average sales for a region or user churn over time but what if you need to get more insight from your data especially text data.? To address that problem, you would add an extra LLM to the process listed above.

Processing SQL Responses

Now we will go over the second way you can access SQL data with LLMs.? Suppose you just launched a product and want to know what users are saying based on the comments in your community forum.? This data may be available in your data warehouse but it would be in the form of unstructured text.? To do this by hand, you would have to search for all responses after the launch date then read through the responses and try to identify patterns.? Luckily AI can help.

Instead of simply outputting the data from a SQL query and reading it by hand like we did in the first example, you can send the output to an LLM and then have the LMM perform further processing of the data.? Here is what the query would look like:

“Find all community comments written after June 1st, 2023.? Search through the comments and summarize any reactions that were related to the new launch”

Below is an example of how the system would work:

Using LLMs to create and interpret SQL queries

A question would be submitted to the LLM model that would output an SQL query.? The results of the query from the database would then be passed to another LLM which would then return a summary for the SQL results.? Since you are calling LLMs twice, you may want to choose a model that is optimized for the task. Instruct type models are better at writing code and chat models are better at summarizing information.?

Note that the same LLM could be used but some models are better at creating queries while others are better at creating summaries.

Caution

When using either of these approaches to searching your SQL data, there are several problems to keep in mind.?

Query Size - When SQL queries are being generated there is a potential to create a query that takes a long time to complete impacting other users or increasing compute costs.

Incorrect Response -? Just because you receive a response from the database does not mean that it is appropriate for your use case.? Example if you ask for total sales in a region, there could be additional line items in your data that do not qualify as revenue that throw off your numbers.? Adding more specifics to the question should minimize errors caused by extra or incorrectly interpreted data.??

A classic problem is when a result will be promoted as an official value such as customer count.? With such a query, it is easy to double count some customers due to how account names are set up. In these situations, it is best to solicit the answer from someone who is familiar with the data and rules around customer count.

Implementation

As usual, there are several ways to automate SQL calls using AI.? If you already have developed pipelines that are making queries to LLMs and to SQL databases, then you can just build on that architecture.? If you want to create a simple proof of concept for experimenting, then I would recommend taking a look at the LangChain Agent for SQL Database.? The open source library comes pre-configured with connections to both OpenAI and SQL databases.? With just a few lines of code, you can be up and running.??

One of the benefits of using LangChain is that it has a verbose output that you can use to troubleshoot problems.? You can see how the agent first asks for the table definition as well as three sample rows for each table.? The agent will then create an SQL query that could include joins and grouping and output the SQL statement.? The statement will then run and the results are passed to the LLM along with the original question for summarization.

Conclusion

LLM models are continuing to become more powerful but there will always need to connect to external sources for real-time data.? By configuring AI agents to query SQL data, you will be able to gain deep insights from your data without the need for the vector conversion required by Retrieval Augmented Generation (RAG).? In addition, AI powered query tools can enable non-technical users to quickly query data without the need for pre-built reports or cumbersome query design tools.

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

10 个月

You talked about simplifying data querying using AI, which indeed has transformative potential in streamlining data access and analysis. Considering the complexity of SQL queries, how does the AI-driven solution ensure accuracy and reliability in interpreting user queries, especially when dealing with intricate database structures and diverse user inputs? Additionally, envisioning a scenario where data privacy is paramount, how would this AI technology securely handle sensitive information while still facilitating seamless data access and insights generation?

回复

要查看或添加评论,请登录

Ward Greunke的更多文章

社区洞察

其他会员也浏览了