Using Generative AI to Automate Data Engineering Workloads

Part -1: A Brief Overview of Large Language Models

One of the key technological paradigms that have recently exploded and has a potential to create a powerful business differentiator is Generative Artificial Intelligence (Gen AI) which is based on pre-trained neural networks popularly known as Large Language Models of LLMs. This technological force is spawning separate branch of software engineering similar to how mobile platform, RDBMS, internet search etc. did in the past and has the potential to solve and/or automate key business problems.

It is not surprising that a lot of different companies have jumped into LLM development and are trying to garner the market while it is still growing, a few notable examples:

Key Features of LLMs

  • Utilizes vector similarity search for the user questions and external data
  • Takes text and/or media as input which is called as Prompt, e.g. — “who won last soccer world cup ?”
  • Model generates responses in text and/or media format which is called as Completions, e.g.— “France”
  • Chunk of text passed in Prompt are known as Tokens
  • At a high level LLMs are categorized in following three categories

Brief & Fascinating History of LLMs

Dec 2015: Open AI is founded as a non-profit, with $1B in pledges

Late 2016: An OpenAI researcher achieves successful result by training neural networks on a corpus of Amazon reviews. The researchers are stunned to find out that behind the scene, the neural network is doing sentiment analysis on the reviews without being explicitly programmed for it. They want to train on internet scale dataset, but technology doesn’t exist yet

2017: Google Brain releases a paper called ‘Attention Is All You Need’ which outlines a new neural network architecture called Transformer which allow parallel tokenization and soft weights. This technology algorithm now allows for faster training of the neural nets on massive datasets

2018: OpenAI releases Generative Pre-Trained Transformer (GPT) model, which is trained on 7000+ books. Google releases BERT. The race is ON

2019: OpenAI releases GPT-2 which is trained on 8M+ webpages (filtered based on reddit references) and has a size of 1.5B parameters. Again researchers are astonished that model has emergent ability for translation without being trained for it.

2020s: OpenAI releases GPT-3 in June 2020 which is trained on full corpus of internet crawl, books & Wikipedia. This is followed by GPT-4 release in Mar 2023. Here’s a quick comparison between GPT-4 & GPT-3


How LLMs Are Used ?

Broadly speaking there are three high level ways to use LLMs, in the increasing order of complexity & accuracy:

  1. Direct Prompts to LLMs: Send context to the LLM in the form of question or instructions either via an API or via chat interface. This is the most popular way of using LLMs, example -


2. Augment LLMs with Proprietary Data: This uses a technique called vectorization where the proprietary data is converted into multi-dimensional numeric vectors


3. Fine Tuning LLMs: You can fine tune the base LLM model on your domain and natively train the model on your domain datasets. This requires lot of computation resources and as a result is an expensive proposition. This should be used only when there is a specific business need. Here’s list of open source fine tuned models on hugginface

A Brief Primer on Vectorization

In the enterprise application of LLMs, vectorization is an important technology process to leverage & understand, for the following reasons

  • The base LLMs are trained on publicly available data, they don’t have access to enterprise proprietary or private data
  • Most companies have lots of data complexity and silos, but struggle to detangle the complexity and create meaning from their data
  • Most business want easy answers from their data, but don’t know how

So how does vectorization solves the above ? In three easy steps

a. It converts the proprietary enterprise data to vector embeddings which are multi-dimensional numeric representation of the textual data. These embeddings are typically stored in a Vector Database. Here is a great primer for a deep dive into vector embeddings

b. The user question is also converted into vector embeddings, and using vector search, the question is compared with the data stored in Vector DB.

c. The closest data match is then passed as an input to the LLM model, which now would have enough context to generate answer to your proprietary question

This process is also know as RAG (Retrieval Augmentation & Generation). Here as an simplified example of this process:


Part -2: Data Engineering Automation Use Case

A typical data ecosystem has databases, data ingestion pipelines, ETL process, data extraction for downstream systems, data quality and various other data touch points. In a data heavy environment, building data pipelines and maintaining this system becomes a full time and a very time consuming task


We propose to solve some of the pain points using a combination a Generative AI and data engineering design frameworks and automate the pain points # 1 & 3

High Level LLM Architecture

Design Approach: Provide context to LLM using Zero Shot Prompting technique which comprises of detailed instructional prompts outlining the data pipeline requirements

Instructions: Instruct LLM only about the schema & metadata of tables being used

Usage: By developers & architects to automate data pipeline creation as a one-time build for any customer

Goal: Understand the Natural Language to generate a SQL query with 80–90% accuracy

Data Pipeline Design: Most data pipeline code can be expressed in the form of SQL with a wrapper framework to manage & execute SQL output (this is the data engineering part)

Execution Strategy

  1. LLM is used as code (SQL) generation engine.
  2. Automated and human output validation of the SQL output
  3. Requirements are written in a way that, it can be easily translated into prompts
  4. A Python based framework is deployed to translate & optimize user requirements to prompt
  5. An execution framework wraps the SQL output for production execution of the code


Prompt Engineering

A python based framework reads the mapping document and database metadata tables to programmatically produce the following prompt for the LLM

  • List of Tables Used — this helps form the ‘from’ clause of the SQL
  • Table DDL and sample data — helps LLM to understand table structure for formatting and translation
  • Referential Integrity Constraints — this helps form the ‘join’ clause of the SQL
  • Table level rules — such as only read active records based on valid flag. This helps form the ‘where’ clause of the SQL
  • Column level rules — this includes core data transformation, business rules and syntax. This helps form the ‘select’ clause of the SQL

The LLM process the above prompt and generates the desired SQL which can then be plugged into your data pipeline framework to execute your data pipeline. A few key pointers

  • Even the 80% solution gives 5–10X improvement and automation
  • Being wrong few times doesn’t erase being right most times
  • Do not release the solution into the production without validation
  • The solution augments human workflows rather than replacing them (at least at first)
  • The quality of data in the prompt is more important than the sheer amount of data
  • Irrelevant data can have a negative impact on the result
  • Clarifying the prompt typically improves performance over standard prompting

Part -3: Productional-izing LLMs based Automation Framework in Production

Congratulations on building your first prototype. You have come this far, now let’s talk about how we can efficiently run this in production.

Key Challenges in Deploying LLM based Frameworks in Production

Token Limit — every base LLM has a token limit and almost every application will run into this problem at some point. It is a good idea to devise workaround strategies


Cost — The pricing model of LLM usage is based on the number of input/output tokens. Like any cloud based service, the usage cost can spiral out of control very fast. Again having a good strategy to manage cost is critical to a successful deployment


Performance — The average LLM query rate is about 2 QPS (query per second). This is a very low number compared to a RESTful service whose QPS can be in thousands. Understanding the performance limitation and designing application around that is also critical part of a good deployment.

Security — If you have sensitive data, gatekeeping and protecting that either via masking or encryption is always a good idea. Also the LLM process should be idempotent i.e. multiple repeated runs with same input should produce the same output

Ways to Manage LLM Cost in Production

Monitoring: Invest in solutions (such as LangSmith) that exposes number of tokens used per query and other similar usage metrics. Once you know the usage trend, only then you can take steps to solve

Prompt Engineering: Use prompt engineering techniques to control & manage the number of tokens that is passed to the model

Caching: Caching involves saving frequently used prompts and their corresponding outputs in a vector store so that we don’t hit the LLMs redundantly

Self Hosted Model: This will have high initial cost, but fine-tuning any open source LLM model using your own proprietary data, tailoring it to specific tasks will improve the cost & performance in the long run

Ways to Get Around LLM’s Token Limit

Chunking

  • The data that needs to be fed into the model is divided into chunks
  • When a user asks a question, each of these chunks (< token limit) is reviewed
  • When there is a section of the chunk that is relevant, that section is combined with the user question
  • This combined text is fed as prompt, and model is able to answer the user’s question

Summarizing/Adaptation

  • In case of large corpus of documents, it is not advisable to feed raw chunks for Q&A.
  • Use summarization for user search and queries. For the most part, the summarized data can answer most of the questions
  • Benefits is that you are sending less data for embeddings

Vectorization

  • Convert data stored in documents or DBs to vectors
  • Convert user question into vector and perform vector search
  • Pass the closest match to the model as a context

Best Practices

Ensure Privacy : Use stateless APIs which by default does not remember or store the inputs

LLM Wrappers : Always a good idea to leverage wrappers like Langchain, which provides lots of value-added services such as prompt templates, data connectors, multiple model support etc. It does not make sense to reinvent the wheel if that is already been invented

Measure model accuracy : Create a benchmark by comparing against a known dataset

Vector Stores : There are many options to store your vector embedding data based on your need — Vector Databases (such as Pinecone), RDBMS (such as Postgres which have vector extension), NoSQL DB (such as Mongo DB, Elasticsearch which also have vector extensions), APIs (such as from Facebook or Hugging face)

Design Is Still the King : Gen AI can help you reduce the work by automating large parts of your data pipelines. However you still need design framework for prompt engineering, for execution platforms, for metadata driven execution, for schema modeling etc. Without a good data engineering architecture supporting the LLMs, the productivity gains would be modest at best.

Data Quality: It is a good idea to add automated data quality checks around the LLM output which should not check the SQL for semantic verification but also test it’s performance and data accuracy. This would be part of your good design framework.

Hope this article helps you with understanding the LLMS and automating the boring and repetitive part of your data workloads. Drop me a note at ashishmrig at yahoo dot com for any feedback or in case of any questions.

Renu Gehring

AI/ML & Tech Leader and Author

11 个月

Ashish, this article is a treasure-trove of the applications of GEN AI ecosystem of tools to data engineering and ingestion. As data grows, so will the need to ingest and engineer it. And this will be very true for small and mid-sized businesses who keep their data on prem or on the edge. Regards, Renu p.s. very curious to know how the adaption of these tools is going.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了