Road to Lakehouse? - ?Part 3: Data Analytics with Generative AI
This is the third part of the Data Lakehouse series, focusing on data analysis with Generative AI. Source code GitHub repository is in a below session. For a pipeline architecture and implementation, please take a look at the previous parts.
Road to Lakehouse - Part 1: Delta Lake data pipeline overview
Road to Lakehouse - Part 2: Ingest and process data from Kafka with CDC
Introduction
After curating the data in the Lakehouse, we are now ready to analyze it and gain insights. With advancements in Generative AI, we can now leverage LLMs to generate SQL queries and answer analytical questions.
This article uses LangChain and LangGraph to build an agent to analyze data in Delta Lake format with Spark SQL, demonstrated by LangGraph Studio. The article also shares some techniques to boost the agent accuracy, and further improvements after the demo.
Quick Start
Data
Data Source and Processing
I use fake Salesforce data from this forum. The raw data, in CSV format, is somewhat messy, with no explicit foreign keys. I want to simulate a real situation that we usually face in the Lakehouse Bronze area, so I don't use well-structured sample data from OLTP systems.
Then I use Spark to do some light transformations and store data under Delta Lake format. We can consider it is a Silver zone with 4 refined tables:
Ideally, we should query data in the Gold layer or data marts, where wide tables with many columns are used. But I want to see how LLM joins these above tables without explicit references so we will use the Silver area for the demo.
In production, we should consider adding more metadata, such as tables and columns, to the LLM context. This can be achieved using tools like a data catalog or dbt for data documentation and lineage.
Query Engine
In this demo, the agent uses Spark SQL to query above tables. There is a small standalone Spark cluster in the container.
In production, the computational resource is significant for the query performance, so we should consider to have a separated Spark/Trino/Presto cluster, or use Cloud services such as Amazon Athena, Azure Synapse Serverless SQL pool, Databricks SQL Serverless, .etc to query a big amount of data in parallel.
LangGraph Agent
The agent is built with LangChain and LangGraph. It first has an instruction as a system prompt. Please find the full prompt in this spark_prompt.py file.
Agent Tools
There are 2 groups of tools, the agent can freely use the safe tools, whereas it needs human approval to execute the sensitive tools.
Safe Tools
Sensitive Tools
I made some small modifications in these tools in comparison with LangChain SparkSQLToolkit, to make it compatible with Delta Lake data format.
Most of the user requests require a list of tables at the first step, so I put them in the system_prompt, then the Agent rarely needs to use the list_tables_sql_db tool.
I placed the query_sql_db tool in the sensitive group so that the agent requires human approval before executing any SQL queries. This prevents excessively large queries from consuming all computing cluster resources or escalating cloud costs.
In production, we should create a SQL user with permissions are always scoped as narrowly as possible for the agent's needs (e.g. read-only). We can also configure user-level quotas to limit resource usage.
Agent Workflow
After defining all agent tools, nodes and conditional edges, we can simply build the agent workflow.
Agent Techniques
Human-in-the-Loop
When building LangGraph agents, including a human-in-the-loop component is often beneficial, especially when granting access to tools. In such cases, you may prefer to manually approve actions before they are executed.
This can be done in various ways, but the main method involves adding an "interrupt" before a node's execution. This pauses the process at that point, allowing you to resume from where it left off once approved.
High-Diversity Fields
To filter columns containing proper nouns like company or product names, it's essential to first verify the spelling to ensure accurate filtering in the WHERE clause.
This can be done by creating a vector store of all the distinct proper nouns in the database. The agent can then query this vector store whenever a user mentions a proper noun, ensuring the correct spelling is used. This ensures that the agent correctly identifies the intended entity before constructing the target query.
领英推荐
In this demo I only embed company name in FAISS.
In production, we can embed as many nouns as possible in a persistent vector db (e.g. PostgreSQL pgvector, qdrant, .etc) and divide them into different db collections, or create different types tag in metadata then use Self-querying to retrieve a correct noun type (e.g. company, product, .etc)
Dynamic few-shot examples
To enhance agent performance, we can supply a custom prompt tailored with domain-specific knowledge. In this scenario, we'll design a few-shot prompt using an example selector, which will dynamically construct the prompt based on user input. This approach allows the model to generate more accurate queries by incorporating relevant examples in the prompt for reference, because a few semantically similar examples are usually better than many static examples.
In production, we can build a more comprehensive dataset based on users and data analysts actual questions and queries, then retrieve only a few relevant ones to put to the system prompt or Human/AI sample conversation.
Agent Conversation
Let try some questions on the tables using LangGraph Studio with gpt-4o.
Here is a full LangSmith trace for the agent conversation with below questions.
show me total money of opportunities group by stage
In the first question, I don't specify the result order, then I update the query to add the ORDER BY before executing the tool query_sql_db.
Another way is we can continue the conversation to ask it order by a filed with ASC or desc
how much can I earn if I acquire all leads from my accounts list?
In the second question, the agent first takes a look at the leads and accounts tables schemes using the tool schema_sql_db, then generate the query.
But the tool query_checker_sql_db feels that the generated query is not correct, it's kind of self-reflection, so the agent regenerates the query again.
how many new leads existing in the accounts illinoi state?
In the third question, I made a typo on illinoi, however LLM has knowledge about this US state so it can simple correct this one in the WHERE clause.
In this demo, the LLM knows how to join those tables by using column names.
In production, we should mention the references between tables (e.g. fact and dimension tables) in the tables or columns comments, then LLM have useful context to generate accurate queries.
is there any working lead for photo bug company?
In the last question, I ask for a photo bug company, LLM doesn't know the exact company name, so it uses the search_proper_company_name tool to find a proper company name to feed into the filter.
Further Improvements
Relevant Subset of Tables
In production, there will be many databases/schemes and tables, we shouldn't put all the table mans in the system prompt, we should rather feed only a relevant subset of tables.
We can do it by adding 1 more step in the tool list_tables_sql_db, asking LLM to generate only needed tables for the user question, and again we need rich schemes and tables descriptions in this step.
Table Partitioning, Clustering and Sorting
In the above demonstration, we only put table schemes with samples row to LLM context in the tool schema_sql_db. We should also share more information about table partitioning, clustering and sort keys, so LLM has more info to generate better perform queries.
We can store this info, as well as table lineage in the data catalog or data discovery tools.
Chain-of-Table
There is a research about evolving tables in the reasoning chain for table understanding with some steps:
I will dive deeper in this method later to see if we can implement any idea from this paper with LangGraph.
PlanRAG
Another research of the plan-then-retrieval augmented generation for LLM as decision makers, which requires (structured) data in databases and business rules.
In summary, there are 3 main steps in PlanRAG (iterative plan-then RAG):
In comparison with the iterative RAG, it has the ‘Plan’ & ‘Re-plan’ step (using instructions in ReAct)
I will also dig into it later and hopefully we can leverage some ideas from this approach.
Conclusion
Data analysis is one of the use cases of the Generative AI. This project serves as a solid starting point for leveraging LLMs to gain insights from data. In a real-world product, several aspects should be considered, including data governance, LLM evaluation, and operations. Additionally, we should explore how to combine structured data with semi-structured and unstructured data for the LLM context.
Here is my article about 9 Methods to Enhance the Performance of an LLM RAG Application, you can check it out and connect with me to discuss more about the AI and LLM.
Machine Learning Engineer | Large Language Models(LLMs) | Generative Ai |
5 个月This article provides a great walkthrough on leveraging LangChain and LangGraph for building data analysis agents using Generative AI. I appreciate the practical use of tools like Spark SQL and Docker for a real-world demonstration. The step-by-step guide, especially around creating workflows and handling human-in-the-loop processes, is very insightful. The suggested improvements, such as PlanRAG and handling large datasets, open up great possibilities for future optimization. Thanks for sharing this practical and forward-thinking approach.
Founder of SmythOS.com | AI Multi-Agent Orchestration ??
6 个月AI-powered data analytics opens up thrilling possibilities. Keen to learn from your insights and real-world use cases
Network Research Associate VN|France
6 个月Very useful!
I bring growth leadership to optimize operations, inspire teams, and achieve measurable organizational success.
6 个月Brilliant exploration of generative AI's lakehouse analytics potential.
Founder at DataForVietnam | Cloud/Data Engineer/Consultant | Educator, Trainer, Speaker | Influencer with 10k Followers across all channels | Empowering developer in their career journey
6 个月This awesome a, will try this soon. I heard also about tools call adalflow allow autotuning prompt, if we can combine all that to a pipeline then that could be much much potential. Great work a