Generative Model Pipeline |Create your own Generative AI
AGI Episode 2: End-to-End Pipeline for Advanced Software Development
Building a generative AI system requires a structured approach to handle its complexity.
In this episode, we’re learnig about the steps needed to build an end-to-end Gen-AI software. This is going to be your pipeline to create your own Generative AI as part of our AGI series.
Let me tell you my approach to learning this stuff
End to End Pipeline
Generative AI pipeline is a set of steps to followed to build an end-to-end Gen-AI software. before we discuss about the pipeline we should know
Well, now that you know how we’re going to proceed, let’s look at the steps in our End-To-End Pipeline:
1. Data Acquisition
Data acquisition means acquiring the data. so actually we need to get the data anyhow to start working on our own Generative Model. so
Is having less Data really the problem? Big NO! This is where DATA AUGMENTATION comes in.
What is Data Augmentation?
It is actually a technique where you mold your data to create a new Set of data. There are various well known technique to do this…
— Technique 1 : Replace with Synonyms
thats how you augment image by adjusting various properties of it… similarly we can perform synonyms technique with Audio as well as Videos type of data
— Technique 2: Biagram Flip
This is something called active passive voice thing actually kind of..
e.g. I am Vritra → Vritra is my name → My name is Vritra
the whole Idea is to increase the data meaningfully!
— Technique 3: Back Translate
Well it sounds tough but it is not [I will tell you later on…]
In this technique we translate language from 1 → 2 → 3 → 1
let say we have something in Hindi then we convert that paragraph to english so it changes little bit in the form or way it was written in hindi.. next we changed it to Spanish let say, then that english text is converted to spanish with slight different in how it was originally written and when you convert this back to Hindi then there is slight difference in original text and back translated text that’s how you augment data more!
[I will tell you later on…] continue.. Yes it is easy as we have lot of library in python that helps in translation so you can use that!
It works well with large text or paragraph
— Technique 4: Add Additional Data/Noise
Last but not least! you can add additional data to the text to augment it
e.g. I am a developer → I am a developer, I love doing code
SO! thats it phewww……
By these Steps you can perform Data Acquisition
2. Data Pre-Processing
After getting your data, we need to clean it up. This isn’t just about removing HTML or unnecessary words — we need to tokenize it too. Here’s what I do:
Step 1: Cleaning Up first
Removing HTML, Emojis (if not needed), Spell-Checks and Correction
Step 2: Basic Pre Processing
In this, we use Tokenization like We use to tokenize words on Sentence Level and Word Level
Step 3: Optional Preprocessing
Includes:
领英推荐
This all we talking about a real world software that actually going to be used by many end users… so if you follow this then it will make sense and easy for you to create a real application
— Tokenization
let’s understand this with an example..
Original text : “My name is Vritra”
Now in this process we convert this word into the vector representation i.e. number
[ My, Name, Is, Vritra ] ( Word level Tokenization )
[“My name is Vritra”, “I am on Dev Hunt”] — Sentence Level
— Steaming [ Less Used ]
let say we have words like “play”, “played”, “playing” and as all of these has same meaning [ Play ] i.e. Root Representation
— It helps in reducing dimension when converting in Vector Representation.
Dimension is a main issue in Generative AI as it confuse the model . [ Curse of Dimensionality ]
— Lemmatization
Same as steaming but the root value is always readable that’s not in the case of steaming
— Lower Casing
Why do we need to lower case words because let say
so that's why we perform lowercase as Vritra and vritra are same but not their ASCII value.
Advance Preprocessing
This is where the work is not mostly done by developer instead a language experts helps them
here we mostly do :
3. Feature Engineering
In this step, we will perform Text Vectorization (text-to-vector)
→ Text Vectorization Techniques are -
Well these are some technique(old) to use in deep learning model but for advance model like Encoder-Decoder, Generative model or large language model we use mostly transformer model
So we just try to convert text to vector in this step. let say we have a image of a cat then you already know every image is made up of pixel dots and that pixel is actually a color block and every colour has numerical value so thats how you use this to convert this into a set of vector values
4. Modelling
In this case you choose different model for your LLM/Generative model to get trained
Well you can →
1. Open source LLM
2. Paid Model like Open AI
Open Source LLM need to train and download locally mostly but paid model get trained in their own server
— its hard to train locally it will be expensive and hard to train with that much parameter
— for cloud case, it will handle al the training process in their server. so you just need to set up your instance and also you need to face some issues like balancing and all
5. Evaluation
Well we have two technique of evaluation → intrensive and Extrensive Evaluation.
— Intrensive case you will have some metrics to perform evaluation. this is perforom by GenAI engineers
— Extrensive is a case perform after deployment. like feedback form etc..
6. Monitoring and Deployment
In this we keep monitoring if something going wrong or not if that happens then we will revise our evaluation and feature engineering step to make it solve the real issue…
This page has all articles for FREE - https://medium.com/ai-threads