Using Generative AI to Automate Data Engineering Workloads
Part -1: A Brief Overview of Large Language Models
One of the key technological paradigms that have recently exploded and has a potential to create a powerful business differentiator is Generative Artificial Intelligence (Gen AI) which is based on pre-trained neural networks popularly known as Large Language Models of LLMs. This technological force is spawning separate branch of software engineering similar to how mobile platform, RDBMS, internet search etc. did in the past and has the potential to solve and/or automate key business problems.
It is not surprising that a lot of different companies have jumped into LLM development and are trying to garner the market while it is still growing, a few notable examples:
Key Features of LLMs
Brief & Fascinating History of LLMs
Dec 2015: Open AI is founded as a non-profit, with $1B in pledges
Late 2016: An OpenAI researcher achieves successful result by training neural networks on a corpus of Amazon reviews. The researchers are stunned to find out that behind the scene, the neural network is doing sentiment analysis on the reviews without being explicitly programmed for it. They want to train on internet scale dataset, but technology doesn’t exist yet
2017: Google Brain releases a paper called ‘Attention Is All You Need’ which outlines a new neural network architecture called Transformer which allow parallel tokenization and soft weights. This technology algorithm now allows for faster training of the neural nets on massive datasets
2018: OpenAI releases Generative Pre-Trained Transformer (GPT) model, which is trained on 7000+ books. Google releases BERT. The race is ON
2019: OpenAI releases GPT-2 which is trained on 8M+ webpages (filtered based on reddit references) and has a size of 1.5B parameters. Again researchers are astonished that model has emergent ability for translation without being trained for it.
2020s: OpenAI releases GPT-3 in June 2020 which is trained on full corpus of internet crawl, books & Wikipedia. This is followed by GPT-4 release in Mar 2023. Here’s a quick comparison between GPT-4 & GPT-3
How LLMs Are Used ?
Broadly speaking there are three high level ways to use LLMs, in the increasing order of complexity & accuracy:
2. Augment LLMs with Proprietary Data: This uses a technique called vectorization where the proprietary data is converted into multi-dimensional numeric vectors
3. Fine Tuning LLMs: You can fine tune the base LLM model on your domain and natively train the model on your domain datasets. This requires lot of computation resources and as a result is an expensive proposition. This should be used only when there is a specific business need. Here’s list of open source fine tuned models on hugginface
A Brief Primer on Vectorization
In the enterprise application of LLMs, vectorization is an important technology process to leverage & understand, for the following reasons
So how does vectorization solves the above ? In three easy steps
a. It converts the proprietary enterprise data to vector embeddings which are multi-dimensional numeric representation of the textual data. These embeddings are typically stored in a Vector Database. Here is a great primer for a deep dive into vector embeddings
b. The user question is also converted into vector embeddings, and using vector search, the question is compared with the data stored in Vector DB.
c. The closest data match is then passed as an input to the LLM model, which now would have enough context to generate answer to your proprietary question
This process is also know as RAG (Retrieval Augmentation & Generation). Here as an simplified example of this process:
Part -2: Data Engineering Automation Use Case
A typical data ecosystem has databases, data ingestion pipelines, ETL process, data extraction for downstream systems, data quality and various other data touch points. In a data heavy environment, building data pipelines and maintaining this system becomes a full time and a very time consuming task
We propose to solve some of the pain points using a combination a Generative AI and data engineering design frameworks and automate the pain points # 1 & 3
High Level LLM Architecture
Design Approach: Provide context to LLM using Zero Shot Prompting technique which comprises of detailed instructional prompts outlining the data pipeline requirements
领英推荐
Instructions: Instruct LLM only about the schema & metadata of tables being used
Usage: By developers & architects to automate data pipeline creation as a one-time build for any customer
Goal: Understand the Natural Language to generate a SQL query with 80–90% accuracy
Data Pipeline Design: Most data pipeline code can be expressed in the form of SQL with a wrapper framework to manage & execute SQL output (this is the data engineering part)
Execution Strategy
Prompt Engineering
A python based framework reads the mapping document and database metadata tables to programmatically produce the following prompt for the LLM
The LLM process the above prompt and generates the desired SQL which can then be plugged into your data pipeline framework to execute your data pipeline. A few key pointers
Part -3: Productional-izing LLMs based Automation Framework in Production
Congratulations on building your first prototype. You have come this far, now let’s talk about how we can efficiently run this in production.
Key Challenges in Deploying LLM based Frameworks in Production
Token Limit — every base LLM has a token limit and almost every application will run into this problem at some point. It is a good idea to devise workaround strategies
Cost — The pricing model of LLM usage is based on the number of input/output tokens. Like any cloud based service, the usage cost can spiral out of control very fast. Again having a good strategy to manage cost is critical to a successful deployment
Performance — The average LLM query rate is about 2 QPS (query per second). This is a very low number compared to a RESTful service whose QPS can be in thousands. Understanding the performance limitation and designing application around that is also critical part of a good deployment.
Security — If you have sensitive data, gatekeeping and protecting that either via masking or encryption is always a good idea. Also the LLM process should be idempotent i.e. multiple repeated runs with same input should produce the same output
Ways to Manage LLM Cost in Production
Monitoring: Invest in solutions (such as LangSmith) that exposes number of tokens used per query and other similar usage metrics. Once you know the usage trend, only then you can take steps to solve
Prompt Engineering: Use prompt engineering techniques to control & manage the number of tokens that is passed to the model
Caching: Caching involves saving frequently used prompts and their corresponding outputs in a vector store so that we don’t hit the LLMs redundantly
Self Hosted Model: This will have high initial cost, but fine-tuning any open source LLM model using your own proprietary data, tailoring it to specific tasks will improve the cost & performance in the long run
Ways to Get Around LLM’s Token Limit
Chunking
Summarizing/Adaptation
Vectorization
Best Practices
Ensure Privacy : Use stateless APIs which by default does not remember or store the inputs
LLM Wrappers : Always a good idea to leverage wrappers like Langchain, which provides lots of value-added services such as prompt templates, data connectors, multiple model support etc. It does not make sense to reinvent the wheel if that is already been invented
Measure model accuracy : Create a benchmark by comparing against a known dataset
Vector Stores : There are many options to store your vector embedding data based on your need — Vector Databases (such as Pinecone), RDBMS (such as Postgres which have vector extension), NoSQL DB (such as Mongo DB, Elasticsearch which also have vector extensions), APIs (such as from Facebook or Hugging face)
Design Is Still the King : Gen AI can help you reduce the work by automating large parts of your data pipelines. However you still need design framework for prompt engineering, for execution platforms, for metadata driven execution, for schema modeling etc. Without a good data engineering architecture supporting the LLMs, the productivity gains would be modest at best.
Data Quality: It is a good idea to add automated data quality checks around the LLM output which should not check the SQL for semantic verification but also test it’s performance and data accuracy. This would be part of your good design framework.
Hope this article helps you with understanding the LLMS and automating the boring and repetitive part of your data workloads. Drop me a note at ashishmrig at yahoo dot com for any feedback or in case of any questions.
AI/ML & Tech Leader and Author
11 个月Ashish, this article is a treasure-trove of the applications of GEN AI ecosystem of tools to data engineering and ingestion. As data grows, so will the need to ingest and engineer it. And this will be very true for small and mid-sized businesses who keep their data on prem or on the edge. Regards, Renu p.s. very curious to know how the adaption of these tools is going.