RAG Prototype #3 - SQL Data
Mark Gerow
Impactful Application Development | Process Automation | Artificial Intelligence | Agile Project Management | Technology Leadership
In my previous RAG article, I explored how to query a vector database containing multiple types of documents (“unstructured data”). Specifically, the second prototype showed how to query customer, product, and sales rep data using a single query, letting vector search and GPT figure out which content was the most relevant for the given question.? But data stored in documents is only half the picture when it comes to business applications; a significant amount of data is stored in SQL databases (“structured data”). In order to apply generative AI to the full range of business needs it must be able to answer questions from both unstructured and structured data.
The data
To keep this first attempt at incorporating structured data simple, I created a small Azure SQL database with one table containing data for Acme AI sales. The table structure looks like this:
In the process of testing this prototype, I found that GPT could not reliably determine how to calculate total amount (price * quantity), so I created a view with that calculated column. The lesson learned being that it is a good idea to let SQL Server rather than GPT handle as many of the calculations as possible, to avoid hallucinations and strange results.
The process
The process implemented by the 3rd prototype is depicted in the following diagram:
1. Prompt seeker to ask a question
Keep asking the seeker for another question, if they just press the [Enter] key, then go to step 10 and exit the program.
2. Did the seeker ask a question?
If a question has been asked, go to step 3.
3. Call GPT to convert the question to a SQL SELECT statement
This is a bit of cool GPT magic! By giving GPT the structure of the data, along with the question, we can prompt it to return valid SQL Server SQL to query the database. Given the following seeker question:
How many orders were placed last January?
The prompt sent to GPT will look something like the following:
Be sure to use correct Microsoft SQL Server syntax. You are a Microsoft SQL Server database developer.You have a Microsoft SQL Server database view named vwORDERS with the following structure:Date datetime, Quantity int, Product varchar(50), Price money, Customer varchar(50), SalesRep varchar(50), Amount money. Provide just the valid Microsoft SQL Server TSQL SELECT statement to answer the following question: How many orders were placed last January?.
And the response from GPT will then be:
SELECT COUNT(*)?FROM vwORDERS?WHERE Date >= '2023-01-01' AND Date < '2023-02-01'
Notice that GPT was able to understand from the prompt the view to use, and the name of the date column. You may have noticed that “Microsoft SQL Server” appears repeatedly in the prompt. This is to help prevent GPT using a SQL syntax (e.g. MySql) that might be incompatible with SQL Server.
Here’s a more complex example, given the following question:
How many orders were placed in January of 2024? Summarize by product. Give me a grand total. Sort by total orders in descending order. show results in a table.
The GPT-generated SQL might be:
SELECT Product, COUNT(*) AS TotalOrders FROM vwORDERS WHERE Date >= '2024-01-01' AND Date < '2024-02-01' GROUP BY Product WITH ROLLUP ORDER BY TotalOrders DESC;
4. Query SQL database for results
Using the SQL SELECT command provided by GPT in step 3, we query the database to obtain the data needed to answer the seeker’s question.
领英推荐
5. Construct a GPT prompt using the original question and SQL data
As in the earlier prototypes which used unstructured data, we append the retrieved data to the end of our prompt, instructing GPT to use it to answer the seeker’s question.
6. Call GPT using the prompt
Once we’ve got our data, we can construct a prompt to tell GPT to generate a nicely formatted reply. The prompt might look like:
You are an AI chatbot. Answer the question: "How many orders were placed in January of 2024? Summarize by product. Give me a grand total. Sort by total orders in descending order. show results in a table." using the following data: Product: ,TotalOrders: 337, Product: AI Chatbot Platform,TotalOrders: 57, Product: AI Fraud Detection System,TotalOrders: 54, Product: AI Virtual Assistant,TotalOrders: 47, Product: AI Predictive Maintenance Software,TotalOrders: 33, Product: AI Sentiment Analysis Tool,TotalOrders: 30, Product: AI Marketing Automation Platform,TotalOrders: 30, Product: AI Recommendation Engine,TotalOrders: 28, Product: AI Data Analytics Suit,TotalOrders: 28, Product: AI Image Recognition Software,TotalOrders: 20, Product: AI Speech Recognition System,TotalOrders: 10,
Here, as in the previous prototype, we told GPT to answer the users question as an AI chatbot, and we tacked on the data returned by SQL server to the prompt. GPT will handle the rest for us!
7. Display the answer
Simply display the response from GPT and prompt seeker for another question.
8. No question entered, so exit
If the seeker pressed the [Enter] key without typing a question in step 1, then exit the program.
Observations
In the process of creating this prototype the greatest challenge was to get GPT to generate SQL using the correct syntax, as there are vendor-specific variations across SQL servers. I needed to make sure that only valid Microsoft SQL Server syntax was provided. Once I solved this problem, there was another more subtle one. It would sometimes get confused as to the date range to use, or which column to sum or count. These queries would be executed without problem but would return unintended results. A couple examples of this include a variation of a question shown above:
How many orders were placed last January? Summarize by product. Give me a grand total. Sort by total orders in descending order.
Which generated the following SQL:
SELECT Product, SUM(Quantity) AS TotalOrders FROM vwORDERS WHERE Date >= '2022-01-01' AND Date < '2022-02-01' GROUP BY Product WITH ROLLUP ORDER BY TotalOrders DESC
Notice that GPT interpreted “last January” to mean January 2022, whereas I was assuming January 2024. I could correct that by being more explicit in my question, but a casual seeker might not notice the discrepancy. The other issue with the above SQL is that GPT is summing the quantity ordered, rather than counting the number of orders. While this might be valid SQL, and it does provide an answer, it’s not the answer I was seeking. If I reword the question as follows, I can reduce the chance for confusion:
Count the orders placed last in the month of January of 2024 and summarize by product. Give me a grand total.
Which generates the following SQL:
SELECT Product, COUNT(*) AS TotalOrders, SUM(Amount) AS TotalAmount FROM vwORDERS WHERE MONTH(Date) = 1 AND YEAR(Date) = 2024 GROUP BY Product WITH ROLLUP HAVING GROUPING(Product) = 0
Which answers the question correctly. All of this is to say that there are still risks of hallucination, and a need for controls within the application to validate any answers coming back from GPT.
Sample chatbot session and output
Here is a sample session:
Conclusion and next steps
This prototype demonstrates the power of applying generative AI to structured data, but also some of the weaknesses. Without proper guardrails, GPT can easily generate SQL and final answers that appear correct but that don’t answer the seeker’s question. Part of the solution could be to provide feedback to the seeker in the form of the generated SQL so that they can inspect the logic - but doing so places a burden on them to review GPT’s work. By refining the prompts used to generate the SQL in the first place it may also be possible to reduce the probability of GPT-induced hallucinations but is unlikely to eliminate it entirely.
In the next RAG prototype, I’ll investigate combining both structured and unstructured data.
Onward!