登录查看更多内容

How to query your SQL data with AI

Ward Greunke

发布日期: 2024年5月8日

How quickly can you answer a question such as, “How many customers signed up last week?”? For people that don’t like querying a database using SQL there were typically two ways to answer this question. The first was to have a set of pre-built reports that anticipated these types of questions that people could access.? Once you found the report, you would apply some filtering and then come up with your answer.? If your question was not covered by the reports then you would need to ask for a new report.? The other method is to have some sort of query design tool that lets you link tables and create reports on the fly assuming the data is available and you know how to use the tool.? If you are asking questions infrequently, you may often forget how to use the tool and end up asking for help anyways.

In this article, I show how AI can solve these problems and as well as gain understanding that was beyond the ability of even the best reports.

SQL Vs RAG

Before jumping into the SQL discussion it is important to mention another important way in which LLMs can be updated with external data.? Retrieval Augmented Generation (RAG) is a two step process that first converts data into vectors that can be read by the LMM and then passes selected vectors to the LLM along with the prompt.? This method works well for unstructured data such as PDFs because the system looks for any information that may be related to the prompt and then sends it to the LLM which then decides which data to use in formulating the response. You can read more about this topic in my previous article on the difference between RAG and fine tuning.

If you were to simply dump your SQL data into a vector database, there could be a large number of records that match your prompt and if you tried to pass all the matching records to the LLM it would increase the response time as well as costs.? Additionally there are some questions that can only be solved using SQL such as joins, grouping and summary statistics.? Luckily there are two methods that you can use to access SQL data using LLMs.

SQL Generation

The first method allows you to simply use AI to generate SQL queries and then run the queries against the data.? A Large Language Model (LLM), would interpret your prompt, create a query based on knowledge of the database schema and then the query would be run against your data warehouse as shown below:

In a typical setup, a Data Lake would store all of your raw data.? An Extract, Transform, and Load (ETL) process would pull the critical data and load it into an SQL Data Warehouse.? The data warehouse would have user friendly tables that have the most relevant information for business users.? A prompt template would be set up that has knowledge of the schema of the data warehouse as well as brief description of each of the tables.? This allows the LLM to know what information is stored in each table as well as the proper syntax to use when creating the query.

When the user asks a question, the prompt would be appended with the prompt template and then sent to the LMM that would then create an SQL query.? The query will be automatically sent to the data warehouse and an answer provided.? Safeguards would be needed such as only allowing read only commands and checking the data privileges of the user.? Read more about prompt templates in my introduction article to chatbots and agents.

The benefits of this system is that no training is required, the user just has to ask a question.? If the user does not get the answer they are looking for they can simply refine the question and press enter.? To improve the efficiency of the system, guidelines could be provided that show what kind of information is in the system as well as any limitations such as what data the user has access to.? Answers could also be outputted as csv if the user would like to do further analysis or dumped into some kind of smart report generating tool.

This approach works well if you are looking for a simple summary statistic like average sales for a region or user churn over time but what if you need to get more insight from your data especially text data.? To address that problem, you would add an extra LLM to the process listed above.

Processing SQL Responses

Now we will go over the second way you can access SQL data with LLMs.? Suppose you just launched a product and want to know what users are saying based on the comments in your community forum.? This data may be available in your data warehouse but it would be in the form of unstructured text.? To do this by hand, you would have to search for all responses after the launch date then read through the responses and try to identify patterns.? Luckily AI can help.

Instead of simply outputting the data from a SQL query and reading it by hand like we did in the first example, you can send the output to an LLM and then have the LMM perform further processing of the data.? Here is what the query would look like:

领英推荐

When to Use CTEs, Subqueries, or Temporary Tables

Andrew Madson MSc, MBA 11 个月前

Databricks SQL Series — Optimizing Data Queries with…

Krishna Yogi Kolluru 8 个月前

Big Data Simplified: SQL Techniques for Massive…

Walter Shields 2 个月前

“Find all community comments written after June 1st, 2023.? Search through the comments and summarize any reactions that were related to the new launch”

Below is an example of how the system would work:

Using LLMs to create and interpret SQL queries

A question would be submitted to the LLM model that would output an SQL query.? The results of the query from the database would then be passed to another LLM which would then return a summary for the SQL results.? Since you are calling LLMs twice, you may want to choose a model that is optimized for the task. Instruct type models are better at writing code and chat models are better at summarizing information.?

Note that the same LLM could be used but some models are better at creating queries while others are better at creating summaries.

Caution

When using either of these approaches to searching your SQL data, there are several problems to keep in mind.?

Query Size - When SQL queries are being generated there is a potential to create a query that takes a long time to complete impacting other users or increasing compute costs.

Incorrect Response -? Just because you receive a response from the database does not mean that it is appropriate for your use case.? Example if you ask for total sales in a region, there could be additional line items in your data that do not qualify as revenue that throw off your numbers.? Adding more specifics to the question should minimize errors caused by extra or incorrectly interpreted data.??

A classic problem is when a result will be promoted as an official value such as customer count.? With such a query, it is easy to double count some customers due to how account names are set up. In these situations, it is best to solicit the answer from someone who is familiar with the data and rules around customer count.

Implementation

As usual, there are several ways to automate SQL calls using AI.? If you already have developed pipelines that are making queries to LLMs and to SQL databases, then you can just build on that architecture.? If you want to create a simple proof of concept for experimenting, then I would recommend taking a look at the LangChain Agent for SQL Database.? The open source library comes pre-configured with connections to both OpenAI and SQL databases.? With just a few lines of code, you can be up and running.??

One of the benefits of using LangChain is that it has a verbose output that you can use to troubleshoot problems.? You can see how the agent first asks for the table definition as well as three sample rows for each table.? The agent will then create an SQL query that could include joins and grouping and output the SQL statement.? The statement will then run and the results are passed to the LLM along with the original question for summarization.

Conclusion

LLM models are continuing to become more powerful but there will always need to connect to external sources for real-time data.? By configuring AI agents to query SQL data, you will be able to gain deep insights from your data without the need for the vector conversion required by Retrieval Augmented Generation (RAG).? In addition, AI powered query tools can enable non-technical users to quickly query data without the need for pre-built reports or cumbersome query design tools.

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

10 个月

You talked about simplifying data querying using AI, which indeed has transformative potential in streamlining data access and analysis. Considering the complexity of SQL queries, how does the AI-driven solution ensure accuracy and reliability in interpreting user queries, especially when dealing with intricate database structures and diverse user inputs? Additionally, envisioning a scenario where data privacy is paramount, how would this AI technology securely handle sensitive information while still facilitating seamless data access and insights generation?

查看更多评论

要查看或添加评论，请登录

Ward Greunke的更多文章

Why focusing on multiple segments actually decreases your market size

2025年1月31日

Why focusing on multiple segments actually decreases your market size

Your market is the intersection of two segments, not the sum! Introduction: Early stage companies often want to chase…
Marketing to developers begins with customer support

2024年9月27日

Marketing to developers begins with customer support

Marketing is waking up to the power that developers have in technology decisions however messaging and hackathons are…
4 key Roles in Effective AI Team

2024年7月12日

4 key Roles in Effective AI Team

Introduction: As AI continues to mature, specializations are starting to emerge. In this article, I will lay out some…
An introduction to the components of an AI chat application Part 2

2024年4月17日

An introduction to the components of an AI chat application Part 2

This is the second part of a two part series where I provide an introduction to the components of an AI chat…
An introduction to how AI modular chatbots work - Part 1

2024年4月5日

An introduction to how AI modular chatbots work - Part 1

This article shows how different tools can be put together in an AI chatbot to create a seamless interface for the user…
How to train AI on your own data using Fine Tuning and Retrieval-Augmented Generation (RAG).

2024年3月29日

How to train AI on your own data using Fine Tuning and Retrieval-Augmented Generation (RAG).

AI chatbots like ChatGPT are great for answering general questions but may be lacking expertise in a specific area. In…
A simplified introduction to Neural Networks.

2024年3月21日

A simplified introduction to Neural Networks.

Neural networks are one of the foundations of Artificial Intelligence (AI) but can be confusing to understand at first.…
Increasing software adoption using simulated missions

2022年8月24日

Increasing software adoption using simulated missions

Many companies offer free trials of their application hoping that once users try it out, they will instantly fall in…
How Virgin Atlantic could help you recover from Jet Lag

2015年7月31日

How Virgin Atlantic could help you recover from Jet Lag

If you have ever been wide awake at three o’clock in the morning after a plane flight, then you know joys of jet lag…
Why people may not respond to your message.

2015年6月5日

Why people may not respond to your message.

If you were in charge of evacuating a city as a hurricane approached, what would you say? According to Gretchen Rubin…

See all articles

How to query your SQL data with AI

Ward Greunke

领英推荐

Ward Greunke的更多文章

社区洞察

其他会员也浏览了

Mastering SQL?Joins: A Comprehensive Guide for Data?Science

Agentic RAGs: consolidated querying of SQL & Document repositories

Essential SQL Skills Every Aspiring Data Scientist Should Know

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

A Step-by-Step Guide: How to Convert Tables to Graph

Essential SQL Tips for Crafting Efficient Queries in Data Engineering

SQL : Key Features, Commands & Practical Use Cases |Data Analytics | Belayet Hossain

Give Your Data a Bath: Ingesting SQL as RDF

Understanding RANK() and PARTITION BY in SQL

Day 23 - All about different file formats in the Big Data Ecosystem

领英推荐

Ward Greunke的更多文章

Why focusing on multiple segments actually decreases your market size

Marketing to developers begins with customer support

4 key Roles in Effective AI Team

An introduction to the components of an AI chat application Part 2

An introduction to how AI modular chatbots work - Part 1

How to train AI on your own data using Fine Tuning and Retrieval-Augmented Generation (RAG).

A simplified introduction to Neural Networks.

Increasing software adoption using simulated missions

How Virgin Atlantic could help you recover from Jet Lag

Why people may not respond to your message.

社区洞察

其他会员也浏览了

Mastering SQL?Joins: A Comprehensive Guide for Data?Science

Agentic RAGs: consolidated querying of SQL & Document repositories

Essential SQL Skills Every Aspiring Data Scientist Should Know

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

A Step-by-Step Guide: How to Convert Tables to Graph

Essential SQL Tips for Crafting Efficient Queries in Data Engineering

SQL : Key Features, Commands & Practical Use Cases |Data Analytics | Belayet Hossain

Give Your Data a Bath: Ingesting SQL as RDF

Understanding RANK() and PARTITION BY in SQL

Day 23 - All about different file formats in the Big Data Ecosystem