登录查看更多内容

Road to Lakehouse? - ?Part 3: Data Analytics with Generative AI

Tam Nguyen

Senior AI Engineer | Generative AI | Machine Learning | LLM

发布日期: 2024年9月8日

This is the third part of the Data Lakehouse series, focusing on data analysis with Generative AI. Source code GitHub repository is in a below session. For a pipeline architecture and implementation, please take a look at the previous parts.

Road to Lakehouse - Part 1: Delta Lake data pipeline overview

Road to Lakehouse - Part 2: Ingest and process data from Kafka with CDC

Introduction

After curating the data in the Lakehouse, we are now ready to analyze it and gain insights. With advancements in Generative AI, we can now leverage LLMs to generate SQL queries and answer analytical questions.

This article uses LangChain and LangGraph to build an agent to analyze data in Delta Lake format with Spark SQL, demonstrated by LangGraph Studio. The article also shares some techniques to boost the agent accuracy, and further improvements after the demo.

Quick Start

Clone/fork this repo: https://github.com/tam159/generative_ai
Create a new file: spark_ai/.env, reference: spark_ai/.env.example
Download and run Docker
Download LangGraph Studio here: https://github.com/langchain-ai/langgraph-studio
In LangGraph Studio, open generative_ai_/spark_ai folder, it will use Docker to pull images and run containers for the agent.
(Optional) Star the above repo ??

Data

Data Source and Processing

I use fake Salesforce data from this forum. The raw data, in CSV format, is somewhat messy, with no explicit foreign keys. I want to simulate a real situation that we usually face in the Lakehouse Bronze area, so I don't use well-structured sample data from OLTP systems.

Then I use Spark to do some light transformations and store data under Delta Lake format. We can consider it is a Silver zone with 4 refined tables:

accounts: company id, name, city, state
contacts: company id, first and last name, country
leads: first and last name, email, company name, status
opportunities: name, company id, project, stage, close date, amount and currency

Ideally, we should query data in the Gold layer or data marts, where wide tables with many columns are used. But I want to see how LLM joins these above tables without explicit references so we will use the Silver area for the demo.

In production, we should consider adding more metadata, such as tables and columns, to the LLM context. This can be achieved using tools like a data catalog or dbt for data documentation and lineage.

Query Engine

In this demo, the agent uses Spark SQL to query above tables. There is a small standalone Spark cluster in the container.

In production, the computational resource is significant for the query performance, so we should consider to have a separated Spark/Trino/Presto cluster, or use Cloud services such as Amazon Athena, Azure Synapse Serverless SQL pool, Databricks SQL Serverless, .etc to query a big amount of data in parallel.

LangGraph Agent

The agent is built with LangChain and LangGraph. It first has an instruction as a system prompt. Please find the full prompt in this spark_prompt.py file.

Agent Tools

There are 2 groups of tools, the agent can freely use the safe tools, whereas it needs human approval to execute the sensitive tools.

Safe Tools

list_tables_sql_db: Input is an empty string, output is a comma separated list of tables in the Spark SQL.
schema_sql_db: Input to this tool is a comma-separated list of tables, output is the schema and sample rows for those tables.
query_checker_sql_db: Use this tool to double check if your query is correct before executing it.
search_proper_company_name: Use to look up values to filter on. Input is an approximate spelling of the proper company name, output is valid proper name.

Sensitive Tools

query_sql_db: Input to this tool is a detailed and correct SQL query, output is a result from the Spark SQL.

I made some small modifications in these tools in comparison with LangChain SparkSQLToolkit, to make it compatible with Delta Lake data format.

Most of the user requests require a list of tables at the first step, so I put them in the system_prompt, then the Agent rarely needs to use the list_tables_sql_db tool.

I placed the query_sql_db tool in the sensitive group so that the agent requires human approval before executing any SQL queries. This prevents excessively large queries from consuming all computing cluster resources or escalating cloud costs.

In production, we should create a SQL user with permissions are always scoped as narrowly as possible for the agent's needs (e.g. read-only). We can also configure user-level quotas to limit resource usage.

Agent Workflow

After defining all agent tools, nodes and conditional edges, we can simply build the agent workflow.

Agent Techniques

Human-in-the-Loop

When building LangGraph agents, including a human-in-the-loop component is often beneficial, especially when granting access to tools. In such cases, you may prefer to manually approve actions before they are executed.

This can be done in various ways, but the main method involves adding an "interrupt" before a node's execution. This pauses the process at that point, allowing you to resume from where it left off once approved.

High-Diversity Fields

To filter columns containing proper nouns like company or product names, it's essential to first verify the spelling to ensure accurate filtering in the WHERE clause.

This can be done by creating a vector store of all the distinct proper nouns in the database. The agent can then query this vector store whenever a user mentions a proper noun, ensuring the correct spelling is used. This ensures that the agent correctly identifies the intended entity before constructing the target query.

领英推荐

Maximize Your POI Data Insights with Foursquare Places…

Foursquare 1 年前

Datatile: A Library for AutoEDA

360DigiTMG 1 年前

Data Merging in Pandas: Left & Right Joins with…

ITVersity, Inc. 3 周前

In this demo I only embed company name in FAISS.

In production, we can embed as many nouns as possible in a persistent vector db (e.g. PostgreSQL pgvector, qdrant, .etc) and divide them into different db collections, or create different types tag in metadata then use Self-querying to retrieve a correct noun type (e.g. company, product, .etc)

Dynamic few-shot examples

To enhance agent performance, we can supply a custom prompt tailored with domain-specific knowledge. In this scenario, we'll design a few-shot prompt using an example selector, which will dynamically construct the prompt based on user input. This approach allows the model to generate more accurate queries by incorporating relevant examples in the prompt for reference, because a few semantically similar examples are usually better than many static examples.

In the spark_prompt.py file, I put only 10 examples from basic to advanced questions and queries.

In production, we can build a more comprehensive dataset based on users and data analysts actual questions and queries, then retrieve only a few relevant ones to put to the system prompt or Human/AI sample conversation.

Agent Conversation

Let try some questions on the tables using LangGraph Studio with gpt-4o.

Here is a full LangSmith trace for the agent conversation with below questions.

show me total money of opportunities group by stage

Human manually updates agent state and approves tool

In the first question, I don't specify the result order, then I update the query to add the ORDER BY before executing the tool query_sql_db.

Another way is we can continue the conversation to ask it order by a filed with ASC or desc

how much can I earn if I acquire all leads from my accounts list?

Agent can generate and execute queries in a loop as many times as it needs to answer the question.

In the second question, the agent first takes a look at the leads and accounts tables schemes using the tool schema_sql_db, then generate the query.

But the tool query_checker_sql_db feels that the generated query is not correct, it's kind of self-reflection, so the agent regenerates the query again.

how many new leads existing in the accounts illinoi state?

The LLM knows how to join tables using field names.

In the third question, I made a typo on illinoi, however LLM has knowledge about this US state so it can simple correct this one in the WHERE clause.

In this demo, the LLM knows how to join those tables by using column names.

In production, we should mention the references between tables (e.g. fact and dimension tables) in the tables or columns comments, then LLM have useful context to generate accurate queries.

is there any working lead for photo bug company?

The agent takes advantages of the search tool to find a correct company name

In the last question, I ask for a photo bug company, LLM doesn't know the exact company name, so it uses the search_proper_company_name tool to find a proper company name to feed into the filter.

Further Improvements

Relevant Subset of Tables

In production, there will be many databases/schemes and tables, we shouldn't put all the table mans in the system prompt, we should rather feed only a relevant subset of tables.

We can do it by adding 1 more step in the tool list_tables_sql_db, asking LLM to generate only needed tables for the user question, and again we need rich schemes and tables descriptions in this step.

Table Partitioning, Clustering and Sorting

In the above demonstration, we only put table schemes with samples row to LLM context in the tool schema_sql_db. We should also share more information about table partitioning, clustering and sort keys, so LLM has more info to generate better perform queries.

We can store this info, as well as table lineage in the data catalog or data discovery tools.

Chain-of-Table

There is a research about evolving tables in the reasoning chain for table understanding with some steps:

Sample next operation based on table, question and operation history
Generate arguments for the sampled operation
Transform table to store the tabular reasoning process

I will dive deeper in this method later to see if we can implement any idea from this paper with LangGraph.

PlanRAG

Another research of the plan-then-retrieval augmented generation for LLM as decision makers, which requires (structured) data in databases and business rules.

In summary, there are 3 main steps in PlanRAG (iterative plan-then RAG):

Making a plan for which kind of analysis is needed for decision.
Generating queries then retrieving necessary data by executing the queries
Assessing whether it needs to make a new plan for further analysis, or makes a decision based on the data

In comparison with the iterative RAG, it has the ‘Plan’ & ‘Re-plan’ step (using instructions in ReAct)

I will also dig into it later and hopefully we can leverage some ideas from this approach.

Conclusion

Data analysis is one of the use cases of the Generative AI. This project serves as a solid starting point for leveraging LLMs to gain insights from data. In a real-world product, several aspects should be considered, including data governance, LLM evaluation, and operations. Additionally, we should explore how to combine structured data with semi-structured and unstructured data for the LLM context.

Here is my article about 9 Methods to Enhance the Performance of an LLM RAG Application, you can check it out and connect with me to discuss more about the AI and LLM.

Muhammad Midhat

Machine Learning Engineer | Large Language Models(LLMs) | Generative Ai |

5 个月

This article provides a great walkthrough on leveraging LangChain and LangGraph for building data analysis agents using Generative AI. I appreciate the practical use of tools like Spark SQL and Docker for a real-world demonstration. The step-by-step guide, especially around creating workflows and handling human-in-the-loop processes, is very insightful. The suggested improvements, such as PlanRAG and handling large datasets, open up great possibilities for future optimization. Thanks for sharing this practical and forward-thinking approach.

1 次回应

Alexander De Ridder

Founder of SmythOS.com | AI Multi-Agent Orchestration ??

6 个月

AI-powered data analytics opens up thrilling possibilities. Keen to learn from your insights and real-world use cases

1 次回应

Linh (Jeanette) Nguyen

Network Research Associate VN|France

6 个月

Very useful!

1 次回应

Soufiane Hamdaoui

I bring growth leadership to optimize operations, inspire teams, and achieve measurable organizational success.

6 个月

Brilliant exploration of generative AI's lakehouse analytics potential.

1 次回应

Nguyen Ba Tung Lam

Founder at DataForVietnam | Cloud/Data Engineer/Consultant | Educator, Trainer, Speaker | Influencer with 10k Followers across all channels | Empowering developer in their career journey

6 个月

This awesome a, will try this soon. I heard also about tools call adalflow allow autotuning prompt, if we can combine all that to a pipeline then that could be much much potential. Great work a

1 次回应

查看更多评论

要查看或添加评论，请登录

Tam Nguyen的更多文章

9 Methods to Enhance the Performance of an LLM RAG Application

2023年11月20日

9 Methods to Enhance the Performance of an LLM RAG Application

It is easy to prototype your first LLM RAG (Retrieval Augmented Generation) application, e.g.

4 条评论
Road to Lakehouse?-?Part 1: Delta Lake data pipeline overview

2022年1月23日

Road to Lakehouse?-?Part 1: Delta Lake data pipeline overview

This is a first part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub…

2 条评论
Streaming Data Pipelines on Cloud Platforms: AWS and GCP

2020年4月2日

Streaming Data Pipelines on Cloud Platforms: AWS and GCP

There are a lot of solutions and products that AWS and GCP provide to you, and what services do they have for the…

8 条评论

Road to Lakehouse? - ?Part 3: Data Analytics with Generative AI

Tam Nguyen

Senior AI Engineer | Generative AI | Machine Learning | LLM

Introduction

Quick Start