登录查看更多内容

Turning business language into data insights within your enterprise

Charles Z. Laughlin

PhD | Principal AI Specialist AWS

发布日期: 2024年10月18日

Abstract

Imagine the potential of agent-based analytics where authorized users can simply say what they want and then easily find accurate information from diverse corporate databases without writing SQL. This article proposes a method to accelerate this self-service goal by using generative AI to create teaching examples that enable chat-agents to anticipate user queries and understand their underlying intentions. While there are many approaches to accomplish this task, this article presents one method to consider and build upon.

Motivation

Two days ago, while speaking at an AWS event, a customer approached me during a break with a question. They wanted to know how to fine-tune a large language model (LLM) that could understand natural language business questions in a chat application and convert them into SQL, ultimately providing actionable datasets.

Our conversation lasted only five minutes, but the person seemed very excited to try out the idea upon returning to their office. I'm writing this article to summarize our discussion, hoping that at least one person will find the idea useful and be able to bring value to their organization.

The context here is a generic enterprise-wide solution, that is able to translate human intention into data with low latency -- independent of any specific database storage, data schema, or application context.

Fine tuning?

The customer specifically mentioned LLM fine-tuning. I asked if their chat application needed to provide factual and correct responses, to which they agreed. After establishing this requirement, I suggested that even with fine-tuning, correct context is still crucial to ensure accurate answers free of hallucination and to produce correct SQL. I also noted that fine-tuning can only produce new models periodically. Therefore, I recommended that before fine-tuning, the customer should try to succeed with a retrieval augmented generation (RAG) approach. The ability to eliminate RAG, even after fine-tuning, is not guaranteed to be universally true. If you're unfamiliar with RAG, I recommend reading this AWS blog for more context and an illustrative architecture diagram. As a general rule, I advise being successful with RAG or some variety of contextual setup first before attempting fine-tuning.

What are correct responses?

Given any natural language question, a correct answer means the intention is converted into an accurate SQL statement with these minimum properties:

1. The workflow can understand which database to connect to.

2. The SQL is constructed using appropriate syntax matching the database engine.

3. The response recordset is accurate and reflects the truth the user was seeking, with correct levels of filtering and aggregation.

To build a bespoke natural language to SQL converter, we must teach the chat agent how to convert a set of business questions into customized SQL. Each enterprise will have its own business language and diverse, highly customized database schemas. To teach a chat agent how to do this, we may need thousands of examples of "what users can say" matched with corresponding SQL. What the user asks is referred to as a prompt; the resulting SQL is the response.

Teaching AI 'your' business language

Step 1: Build a catalog of valid SQL responses

To generate these prompt-response pairs as teaching examples, I recommend that companies use the audit trail of previously issued SQL to satisfy the response portion of training data. Some databases have mechanisms to capture queries sent to them in the form of an audit trail or SQL cache. If not through the database, applications can be tooled to capture who submitted what query and when. Consider capturing a month or quarter of SQL if your business has monthly or quarterly seasonality. If each week is more or less the same, perhaps less history is needed. In this dataset, take care to remove literal values and then deduplicate the SQL. It's important to remove literal values to avoid exposing PII or truths in your training data.

If you don't have a programmatic way of removing literal values, you can use Generative AI. Here's an example of a prompt that works well in Amazon Bedrock using Anthropic's Claude 3 Haiku model:

remove all literal values from this SQL and replace with bind variables.
return only cleaned SQL as a response. 
<SQL>
select col1, col2, col3, col4 
from mytable 
where col5='CUSTOMER'
</SQL>

Step 2: Build a set of questions for each SQL statement out of Step 1

In this step, I demonstrate how to use another Bedrock model to convert the SQL into a proposed set of questions that would have generated it. Here's an example prompt which works well with the Anthropic Claude 3 Sonnet model:

First, evaluate the SQL statement provided in <SQL> tags. Next, provide 6 variations of natural language questions a person could ask that would cause this SQL to be generated.  Take care to use synonyms and diverse ways of asking the natural language questions. Your response requires a very specific format in JSON form. A general template for how you should answer is provided inside <RESPONSE> tag. Provide only the JSON response without comments.

<SQL>
WITH sales_summary AS (
    SELECT 
        product_id,
        SUM(quantity) as total_quantity,
        SUM(quantity * unit_price) as total_revenue
    FROM X29
    GROUP BY product_id
),
product_ranking AS (
    SELECT 
        p.product_id,
        p.product_name,
        s.total_quantity,
        s.total_revenue,
        ROW_NUMBER() OVER (ORDER BY s.total_revenue DESC) as revenue_rank,
        PERCENT_RANK() OVER (ORDER BY s.total_quantity) as quantity_percentile
    FROM B27 p
    JOIN sales_summary s ON p.product_id = s.product_id
)
SELECT 
    pr.product_id,
    pr.product_name,
    pr.total_quantity,
    pr.total_revenue,
    pr.revenue_rank,
    pr.quantity_percentile,
    CASE 
        WHEN pr.quantity_percentile >= :high_volume_threshold THEN 'High Volume'
        WHEN pr.quantity_percentile >= :medium_volume_threshold THEN 'Medium Volume'
        ELSE 'Low Volume'
    END AS volume_category
FROM product_ranking pr
WHERE pr.revenue_rank <= :top_n_products
ORDER BY pr.total_revenue DESC;
</SQL>

<RESPONSE>
{
  "natural_language_query": [
    "Which products are our best performers in terms of revenue and quantity sold?",
    "Give me a rundown of our top products by revenue, including their quantity percentiles and volume categories.",
    "What does the performance breakdown look like for our highest-earning products, considering both revenue and quantity?"
  ]
}
</RESPONSE>

Besides Anthropic Claude 3 Sonnet, consider exploring other models as well by discovering the list of available foundation models. Meanwhile, the response from Sonnet looks like this, exactly as planned and with 6 examples as requested.

{
  "natural_language_query": [
    "Which products are our top revenue generators, and how do they rank in terms of sales volume?",
    "Can you provide a list of our best-selling items, including their revenue, quantity sold, and volume classification?",
    "What are our most profitable products, and how do they compare in terms of units moved?",
    "Show me a breakdown of our highest-earning merchandise, including their sales ranking and quantity percentiles.",
    "Which items are driving our revenue, and how does their sales volume compare to other products?",
    "Give me an analysis of our top-performing products, considering both monetary value and units sold."
  ]
}

领英推荐

Data, meet Graph: Kubrick Partners with Neo4j

Kubrick Group 1 年前

Recent Trends in SAS

Sankhyana Consultancy Services Pvt. Ltd. 5 个月前

Top Web and Mobile Data Extraction Services Providers…

WebDataGuru 2 年前

Step 3: Use embedding model and store data in a vector database

At this point, you have captured your original SQL and N plain language questions for the SQL. You should also have a feature that describes which database the SQL was extracted from. There may also be other "tags" or labels for this SQL that reflect a department, cost center, or topic to help make information searchable. This blog is one of many that describe how to vectorize data and store in a vector database.

First, you will "embed" the proposed natural language statements - all six from the response above. Next, you will store each of the six embeddings in a vector database, placing the vector in a feature used for semantic matching.

Amazon OpenSearch Service is advantageous because you can store your vector along with "the SQL" and any other desired features such as databases, tags, etc., in a single record. Later, when searching OpenSearch, you can perform multiple tasks with a single API call:

1. Find top-N matching candidates on the "natural language ask"

2. Perform filtering - such as only checking databaseX or tagY

3. Obtain the proposed SQL model

'Your' business language in action

In the prior section, I describe high level points on how to generate prompt-response pairs then store them in a vector database along with other metadata.

Now, it's time to use the chat agent in action. Here's a list of macro-level steps that should take place in the correct order:

1. The user types in (or says) what they're looking for, which is converted into an embedding vector. Prior to embedding, care should be taken to remove exact literals, replacing them with bind variables to help keep vector distances normalized.

2. The embedding string is used to search OpenSearch (or other vector database) for top-N matching records. The database returns the SQL back to the application. Consider retrieving the first three responses in case the top response fails to work.

3. The application should submit the original user SQL to an LLM prompt using the SQL returned from the prior step, asking the LLM to adapt the model SQL accordingly.

Evaluate the SQL statement provided within the <SQL> tag. The SQL provides a template for answering the question or directive located in the <REQUEST> tag. Next, evaluate the REQUEST offered.  Third, adapt the SQL to meet the specification in the REQUEST.  The OUTPUT should be a modified SQL statement only. Provide the executable SQL as the response without commentary.

<SQL>
WITH sales_summary AS (
    SELECT 
        product_id,
        SUM(quantity) as total_quantity,
        SUM(quantity * unit_price) as total_revenue
    FROM X29
    GROUP BY product_id
),
product_ranking AS (
    SELECT 
        p.product_id,
        p.product_name,
        s.total_quantity,
        s.total_revenue,
        ROW_NUMBER() OVER (ORDER BY s.total_revenue DESC) as revenue_rank,
        PERCENT_RANK() OVER (ORDER BY s.total_quantity) as quantity_percentile
    FROM B27 p
    JOIN sales_summary s ON p.product_id = s.product_id
)
SELECT 
    pr.product_id,
    pr.product_name,
    pr.total_quantity,
    pr.total_revenue,
    pr.revenue_rank,
    pr.quantity_percentile,
    CASE 
        WHEN pr.quantity_percentile >= :high_volume_threshold THEN 'High Volume'
        WHEN pr.quantity_percentile >= :medium_volume_threshold THEN 'Medium Volume'
        ELSE 'Low Volume'
    END AS volume_category
FROM product_ranking pr
WHERE pr.revenue_rank <= :top_n_products
ORDER BY pr.total_revenue DESC;
</SQL>
<REQUEST>
What are 5 top profitable products?
</REQUEST>

4. The LLM provides back actual SQL to be executed -- response to "What are 5 top profitable products?" The answer below looks reasonable. At this point, the application should execute the SQL. If it fails to execute for database syntax reasons, perhaps try the second SQL statement as a backup.

SELECT 
    pr.product_id,
    pr.product_name,
    pr.total_revenue
FROM (
    SELECT 
        p.product_id,
        p.product_name,
        SUM(x.quantity * x.unit_price) as total_revenue,
        ROW_NUMBER() OVER (ORDER BY SUM(x.quantity * x.unit_price) DESC) as revenue_rank
    FROM B27 p
    JOIN X29 x ON p.product_id = x.product_id
    GROUP BY p.product_id, p.product_name
) pr
WHERE pr.revenue_rank <= 5
ORDER BY pr.total_revenue DESC;

Build a feedback loop

In your application, design a mechanism that allows users to signal if their input generated the correct SQL or not. If the SQL was not retrievable, a ticket can be opened for an expert to translate the natural language ask into SQL and store the record in the vector database, closing the gap.

If your users feel the answer was wrong, they can downvote the experience. This should create a ticket for someone to inspect the question and SQL. The expert should determine the correct request and SQL response pair and have the ability to logically delete or update the record in the vector database.

If your users upvote the response because they're happy with the retrieved dataset, their question and SQL can be stored in the vector database to extend the original set of training data. As a caution, ensure the embedding of this question is not a duplicate. Care must be taken so that all questions are unique.

Conclusion

This article described a method to turn business language into SQL statements to generate data that users can use to make decisions from. Enabling users to self-serve, even without knowing SQL, is a longstanding goal for many organizations.

The article proposed a way to jumpstart a database of request and response pairs using a combination of an existing SQL audit trail with an LLM to help generate labels - a potentially time-consuming task if done manually. Next, the article advocated for using RAG, embedding models, and a vector database as key enablers. While high-level, the article provided some example blogs to read for detailed examples and architectures.

Finally, this article suggested a method to build in a feedback loop as a crucial part of the design. This feedback loop enables novel questions to be cataloged and bad responses to be removed. Over time, this design should combat entropy, adapt as the business does, and provide users with more friction-free experiences.

There are many ways to accomplish any goal. This article does not attempt to be complete or perfect. Its purpose was to offer an idea, the result of a 5-minute conversation, now written down and shared for further consideration, development and improvement.

Turning business language into data insights within your enterprise

Charles Z. Laughlin

PhD | Principal AI Specialist AWS

Abstract

Motivation

Fine tuning?

What are correct responses?

Teaching AI 'your' business language

Step 1: Build a catalog of valid SQL responses

Step 2: Build a set of questions for each SQL statement out of Step 1

领英推荐

Step 3: Use embedding model and store data in a vector database

'Your' business language in action

Build a feedback loop

Conclusion

社区洞察

其他会员也浏览了

Leading Vector Databases: The Top 3 Choices

July Data News

AI-Powered Querying: A New Era for Aqua Data Studio

Getting Started with PromQL: Why we Need it for Observability Queries

BigQuery Data Canvas: Revolutionizing Data Analytics

Data Science Unveiled: The New Age of Data-Driven Decision Making

Vector Databases: The Backbone of Retrieval-Augmented Generation (RAG) Systems ????

Hex - The Best Product for the Technical Analyst

Understanding Databases like Graph, Vector, and Relational Databases with Real-World Examples

Unlocking the Power of Llama: Fine-Tuning an 8B Model to Generate SQL Queries with LoRA/QLoRA