Generative Model Pipeline |Create your own Generative AI

Generative Model Pipeline |Create your own Generative AI

AGI Episode 2: End-to-End Pipeline for Advanced Software Development

Building a generative AI system requires a structured approach to handle its complexity.

In this episode, we’re learnig about the steps needed to build an end-to-end Gen-AI software. This is going to be your pipeline to create your own Generative AI as part of our AGI series.

Let me tell you my approach to learning this stuff

  • Break the problem down into several sub problems, then
  • We try to solve them step by step which is PIPELINE WORK
  • And, We list all form of text processing needed at each Step!

End to End Pipeline

Generative AI pipeline is a set of steps to followed to build an end-to-end Gen-AI software. before we discuss about the pipeline we should know

Well, now that you know how we’re going to proceed, let’s look at the steps in our End-To-End Pipeline:

  • Data Acquisition
  • Data Preparation
  • Feature Engineering
  • Modelling
  • Evaluation
  • Deployment
  • Monitoring and Model Updating

1. Data Acquisition

Data acquisition means acquiring the data. so actually we need to get the data anyhow to start working on our own Generative Model. so

  • Own Data: First of all you need to check if you have data available or not? this data can be any form… it can be XLSX, CSV, TXT, PDF, DOCS etc…
  • Other’s Data : Now if you get to know that you don’t have any data then its plan B time! you have to look for data or from where you can acquire data, I mean Other’s Data. like — you have to surf internet, or check any database or dataset or may be you need any third party API that serve data or at the end may be you need to do Web Scraping!!
  • NO DATA : Another case!, if you end with NO DATA. In that case you have to generate a data by yourself or I would say smartly you can use LLM to generate the data e.g. OpenAI Gpt . but this is the only case where you have less data

Is having less Data really the problem? Big NO! This is where DATA AUGMENTATION comes in.

What is Data Augmentation?

It is actually a technique where you mold your data to create a new Set of data. There are various well known technique to do this…

— Technique 1 : Replace with Synonyms

  • Let say, we have a text data “I am a AI Develper” then synonyms be like “I am a Dotnet Developer” “I am a Data Scientist”. so just with the single sentence we can generate lot other data just by replacing the synonyms
  • This is not just only for a TEXT! you can actually perform the same with the images too. here is the example

thats how you augment image by adjusting various properties of it… similarly we can perform synonyms technique with Audio as well as Videos type of data

— Technique 2: Biagram Flip

This is something called active passive voice thing actually kind of..

e.g. I am Vritra → Vritra is my name → My name is Vritra

the whole Idea is to increase the data meaningfully!

— Technique 3: Back Translate

Well it sounds tough but it is not [I will tell you later on…]

In this technique we translate language from 1 → 2 → 3 → 1

let say we have something in Hindi then we convert that paragraph to english so it changes little bit in the form or way it was written in hindi.. next we changed it to Spanish let say, then that english text is converted to spanish with slight different in how it was originally written and when you convert this back to Hindi then there is slight difference in original text and back translated text that’s how you augment data more!

[I will tell you later on…] continue.. Yes it is easy as we have lot of library in python that helps in translation so you can use that!

It works well with large text or paragraph

— Technique 4: Add Additional Data/Noise

Last but not least! you can add additional data to the text to augment it

e.g. I am a developer → I am a developer, I love doing code

SO! thats it phewww……

By these Steps you can perform Data Acquisition

2. Data Pre-Processing

After getting your data, we need to clean it up. This isn’t just about removing HTML or unnecessary words — we need to tokenize it too. Here’s what I do:

Step 1: Cleaning Up first

Removing HTML, Emojis (if not needed), Spell-Checks and Correction

Step 2: Basic Pre Processing

In this, we use Tokenization like We use to tokenize words on Sentence Level and Word Level

Step 3: Optional Preprocessing

Includes:

  • Stop Word Removal
  • Steaming → less used nowdays
  • Lamatization → More Used
  • Punctuation Removal (. , ; $ ! ) etc..
  • Lower Case
  • Language Detector

This all we talking about a real world software that actually going to be used by many end users… so if you follow this then it will make sense and easy for you to create a real application

— Tokenization

let’s understand this with an example..

Original text : “My name is Vritra”

Now in this process we convert this word into the vector representation i.e. number

[ My, Name, Is, Vritra ] ( Word level Tokenization )

[“My name is Vritra”, “I am on Dev Hunt”] — Sentence Level

— Steaming [ Less Used ]

let say we have words like “play”, “played”, “playing” and as all of these has same meaning [ Play ] i.e. Root Representation

— It helps in reducing dimension when converting in Vector Representation.

Dimension is a main issue in Generative AI as it confuse the model . [ Curse of Dimensionality ]

— Lemmatization

Same as steaming but the root value is always readable that’s not in the case of steaming

— Lower Casing

Why do we need to lower case words because let say

  • Vritra work for Company
  • vritra is a developer

so that's why we perform lowercase as Vritra and vritra are same but not their ASCII value.

Advance Preprocessing

This is where the work is not mostly done by developer instead a language experts helps them

here we mostly do :

  • Parts of Speech tagging
  • Parsing
  • Co-reference Resolution

3. Feature Engineering

In this step, we will perform Text Vectorization (text-to-vector)

→ Text Vectorization Techniques are -

  • TFIDF
  • Bag of Word
  • Word to Vector
  • One Hot
  • Transformer Model

Well these are some technique(old) to use in deep learning model but for advance model like Encoder-Decoder, Generative model or large language model we use mostly transformer model

So we just try to convert text to vector in this step. let say we have a image of a cat then you already know every image is made up of pixel dots and that pixel is actually a color block and every colour has numerical value so thats how you use this to convert this into a set of vector values

4. Modelling

In this case you choose different model for your LLM/Generative model to get trained

Well you can →

1. Open source LLM

2. Paid Model like Open AI

Open Source LLM need to train and download locally mostly but paid model get trained in their own server

— its hard to train locally it will be expensive and hard to train with that much parameter

— for cloud case, it will handle al the training process in their server. so you just need to set up your instance and also you need to face some issues like balancing and all

5. Evaluation

Well we have two technique of evaluation → intrensive and Extrensive Evaluation.

— Intrensive case you will have some metrics to perform evaluation. this is perforom by GenAI engineers

— Extrensive is a case perform after deployment. like feedback form etc..

6. Monitoring and Deployment

In this we keep monitoring if something going wrong or not if that happens then we will revise our evaluation and feature engineering step to make it solve the real issue…




This page has all articles for FREE - https://medium.com/ai-threads

要查看或添加评论,请登录

Vinod Singh Rautela的更多文章

社区洞察

其他会员也浏览了