登录查看更多内容

Generative Model Pipeline |Create your own Generative AI

Vinod Singh Rautela

Full Stack Developer | AI Engineer | Team Collaborator

发布日期: 2025年1月16日

+ 关注

AGI Episode 2: End-to-End Pipeline for Advanced Software Development

Building a generative AI system requires a structured approach to handle its complexity.

In this episode, we’re learnig about the steps needed to build an end-to-end Gen-AI software. This is going to be your pipeline to create your own Generative AI as part of our AGI series.

Let me tell you my approach to learning this stuff

Break the problem down into several sub problems, then
We try to solve them step by step which is PIPELINE WORK
And, We list all form of text processing needed at each Step!

End to End Pipeline

Generative AI pipeline is a set of steps to followed to build an end-to-end Gen-AI software. before we discuss about the pipeline we should know

Well, now that you know how we’re going to proceed, let’s look at the steps in our End-To-End Pipeline:

Data Acquisition
Data Preparation
Feature Engineering
Modelling
Evaluation
Deployment
Monitoring and Model Updating

1. Data Acquisition

Data acquisition means acquiring the data. so actually we need to get the data anyhow to start working on our own Generative Model. so

Own Data: First of all you need to check if you have data available or not? this data can be any form… it can be XLSX, CSV, TXT, PDF, DOCS etc…
Other’s Data : Now if you get to know that you don’t have any data then its plan B time! you have to look for data or from where you can acquire data, I mean Other’s Data. like — you have to surf internet, or check any database or dataset or may be you need any third party API that serve data or at the end may be you need to do Web Scraping!!
NO DATA : Another case!, if you end with NO DATA. In that case you have to generate a data by yourself or I would say smartly you can use LLM to generate the data e.g. OpenAI Gpt . but this is the only case where you have less data

Is having less Data really the problem? Big NO! This is where DATA AUGMENTATION comes in.

What is Data Augmentation?

It is actually a technique where you mold your data to create a new Set of data. There are various well known technique to do this…

— Technique 1 : Replace with Synonyms

Let say, we have a text data “I am a AI Develper” then synonyms be like “I am a Dotnet Developer” “I am a Data Scientist”. so just with the single sentence we can generate lot other data just by replacing the synonyms
This is not just only for a TEXT! you can actually perform the same with the images too. here is the example

thats how you augment image by adjusting various properties of it… similarly we can perform synonyms technique with Audio as well as Videos type of data

— Technique 2: Biagram Flip

This is something called active passive voice thing actually kind of..

e.g. I am Vritra → Vritra is my name → My name is Vritra

the whole Idea is to increase the data meaningfully!

— Technique 3: Back Translate

Well it sounds tough but it is not [I will tell you later on…]

In this technique we translate language from 1 → 2 → 3 → 1

let say we have something in Hindi then we convert that paragraph to english so it changes little bit in the form or way it was written in hindi.. next we changed it to Spanish let say, then that english text is converted to spanish with slight different in how it was originally written and when you convert this back to Hindi then there is slight difference in original text and back translated text that’s how you augment data more!

[I will tell you later on…] continue.. Yes it is easy as we have lot of library in python that helps in translation so you can use that!

It works well with large text or paragraph

— Technique 4: Add Additional Data/Noise

Last but not least! you can add additional data to the text to augment it

e.g. I am a developer → I am a developer, I love doing code

SO! thats it phewww……

By these Steps you can perform Data Acquisition

2. Data Pre-Processing

After getting your data, we need to clean it up. This isn’t just about removing HTML or unnecessary words — we need to tokenize it too. Here’s what I do:

Step 1: Cleaning Up first

Removing HTML, Emojis (if not needed), Spell-Checks and Correction

Step 2: Basic Pre Processing

In this, we use Tokenization like We use to tokenize words on Sentence Level and Word Level

Step 3: Optional Preprocessing

Includes:

Stop Word Removal
Steaming → less used nowdays
Lamatization → More Used
Punctuation Removal (. , ; $ ! ) etc..
Lower Case
Language Detector

领英推荐

Must-Use Data Visualization Datasets, AI Frameworks…

Open Data Science Conference (ODSC) 6 个月前

Issue #310 - The ML Engineer ??

Alejandro Saucedo 4 个月前

Cost savings by using DeepSeek R1 for Product Taxonomy…

Itransition Group 1 个月前

This all we talking about a real world software that actually going to be used by many end users… so if you follow this then it will make sense and easy for you to create a real application

— Tokenization

let’s understand this with an example..

Original text : “My name is Vritra”

Now in this process we convert this word into the vector representation i.e. number

[ My, Name, Is, Vritra ] ( Word level Tokenization )

[“My name is Vritra”, “I am on Dev Hunt”] — Sentence Level

— Steaming [ Less Used ]

let say we have words like “play”, “played”, “playing” and as all of these has same meaning [ Play ] i.e. Root Representation

— It helps in reducing dimension when converting in Vector Representation.

Dimension is a main issue in Generative AI as it confuse the model . [ Curse of Dimensionality ]

— Lemmatization

Same as steaming but the root value is always readable that’s not in the case of steaming

— Lower Casing

Why do we need to lower case words because let say

Vritra work for Company
vritra is a developer

so that's why we perform lowercase as Vritra and vritra are same but not their ASCII value.

Advance Preprocessing

This is where the work is not mostly done by developer instead a language experts helps them

here we mostly do :

Parts of Speech tagging
Parsing
Co-reference Resolution

3. Feature Engineering

In this step, we will perform Text Vectorization (text-to-vector)

→ Text Vectorization Techniques are -

TFIDF
Bag of Word
Word to Vector
One Hot
Transformer Model

Well these are some technique(old) to use in deep learning model but for advance model like Encoder-Decoder, Generative model or large language model we use mostly transformer model

So we just try to convert text to vector in this step. let say we have a image of a cat then you already know every image is made up of pixel dots and that pixel is actually a color block and every colour has numerical value so thats how you use this to convert this into a set of vector values

4. Modelling

In this case you choose different model for your LLM/Generative model to get trained

Well you can →

1. Open source LLM

2. Paid Model like Open AI

Open Source LLM need to train and download locally mostly but paid model get trained in their own server

— its hard to train locally it will be expensive and hard to train with that much parameter

— for cloud case, it will handle al the training process in their server. so you just need to set up your instance and also you need to face some issues like balancing and all

5. Evaluation

Well we have two technique of evaluation → intrensive and Extrensive Evaluation.

— Intrensive case you will have some metrics to perform evaluation. this is perforom by GenAI engineers

— Extrensive is a case perform after deployment. like feedback form etc..

6. Monitoring and Deployment

In this we keep monitoring if something going wrong or not if that happens then we will revise our evaluation and feature engineering step to make it solve the real issue…

This page has all articles for FREE - https://medium.com/ai-threads

AGI

207 位关注者

要查看或添加评论，请登录

Vinod Singh Rautela的更多文章

10x Faster JSON Serialization

2025年3月17日

10x Faster JSON Serialization

JSON serialization is the silent performance killer in many .NET applications.
Our Database Was Too ‘Dumb’: How Breaking This Ritual Made Our Payment API 3x Faster

2025年2月10日

Our Database Was Too ‘Dumb’: How Breaking This Ritual Made Our Payment API 3x Faster

Moving Business Logic to SQL Improved Our API Response Time by 300% See: Your payment processing platform is handling…
A CheatSheet of 128 CheatSheets for Developers

2025年1月31日

A CheatSheet of 128 CheatSheets for Developers

Drafted from 2020 I hope there are one or two useful links for you regarding the technologies you use..
How to use Async & Await with ForEach in C#?-?The Right?Way

2025年1月25日

How to use Async & Await with ForEach in C#?-?The Right?Way

As senior .NET developers, we’ve all used async/await for handling asynchronous operations.

1 条评论
My 2024 Good Link List: Start your 2025 Developer Journey

2025年1月23日

My 2024 Good Link List: Start your 2025 Developer Journey

I saved it; now it’s your turn. Developer Tools & Platforms Frameworks / Infrastructure / Backends Supabase —…
$200K DateTime Bug: What Every Dev Gets Wrong About Time in Trading?Systems

2025年1月20日

$200K DateTime Bug: What Every Dev Gets Wrong About Time in Trading?Systems

Precision issues with different DateTime types It was 2:15 PM [ Friday afternoon], and the trading floor was humming…
Discriminative Model vs Generative Model?: AGI Episode?1

2025年1月14日

Discriminative Model vs Generative Model?: AGI Episode?1

Understanding generative AI basics In this episode we are going to learn about Discriminative model and how we transit…
Stop Asking “Git vs Azure DevOps”?—?Here’s What You Actually Need to Know

2025年1月11日

Stop Asking “Git vs Azure DevOps”?—?Here’s What You Actually Need to Know

TL;DR Git and Azure DevOps aren’t competitors?—?Git is your version control system, and Azure DevOps is a platform that…

2 条评论
.NET Object Mapping: How We Process $2M Worth of Stock Trades Every Minute

2025年1月1日

.NET Object Mapping: How We Process $2M Worth of Stock Trades Every Minute

Our object mapping strategy couldn’t handle the scale… It was 9:30 AM on a Monday morning. The stock market had just…

8 条评论
CQRS in?.NET: The Most Overengineered Solution for Simple?Problems

2024年12月30日

CQRS in?.NET: The Most Overengineered Solution for Simple?Problems

What is CQRS and Why Do People Choose It? As software systems grow in complexity, we often reach for architectural…

6 条评论

See all articles

Generative Model Pipeline |Create your own Generative AI

Vinod Singh Rautela

Full Stack Developer | AI Engineer | Team Collaborator

AGI Episode 2: End-to-End Pipeline for Advanced Software Development

End to End Pipeline

1. Data Acquisition

What is Data Augmentation?

2. Data Pre-Processing

领英推荐

— Tokenization

— Steaming [ Less Used ]

— Lemmatization

— Lower Casing

Advance Preprocessing

3. Feature Engineering

4. Modelling

5. Evaluation

6. Monitoring and Deployment

AGI

207 位关注者

Vinod Singh Rautela的更多文章

社区洞察

其他会员也浏览了

Dataiku

No Code Retrieval-Augmented Generation (RAG) with OCI Generative AI Agents

Enabling domain experts to master data science

Data Phoenix Digest - ISSUE 2.2023

ModelOps and MLOps is Key elements for Enterprise AI

IA Generative journey within Snowflake - The Future is now.

Machine Learning for Developers (ML4Devs Newsletter, Issue 1)

Unleashing the Power of Knowledge Graphs in RAG Applications

Structured Outputs from LLMs: LangChain Output Parsers

The AI Revolution in Test Data Generation

AGI Episode 2: End-to-End Pipeline for Advanced Software Development

End to End Pipeline

1. Data Acquisition

What is Data Augmentation?

2. Data Pre-Processing

领英推荐

— Tokenization

— Steaming [ Less Used ]

— Lemmatization

— Lower Casing

Advance Preprocessing

3. Feature Engineering

4. Modelling

5. Evaluation

6. Monitoring and Deployment

AGI

207 位关注者

Vinod Singh Rautela的更多文章

10x Faster JSON Serialization

Our Database Was Too ‘Dumb’: How Breaking This Ritual Made Our Payment API 3x Faster

A CheatSheet of 128 CheatSheets for Developers

How to use Async & Await with ForEach in C#?-?The Right?Way

My 2024 Good Link List: Start your 2025 Developer Journey

$200K DateTime Bug: What Every Dev Gets Wrong About Time in Trading?Systems

Discriminative Model vs Generative Model?: AGI Episode?1

Stop Asking “Git vs Azure DevOps”?—?Here’s What You Actually Need to Know

.NET Object Mapping: How We Process $2M Worth of Stock Trades Every Minute

CQRS in?.NET: The Most Overengineered Solution for Simple?Problems

社区洞察

其他会员也浏览了

Dataiku

No Code Retrieval-Augmented Generation (RAG) with OCI Generative AI Agents

Enabling domain experts to master data science

Data Phoenix Digest - ISSUE 2.2023

ModelOps and MLOps is Key elements for Enterprise AI

IA Generative journey within Snowflake - The Future is now.

Machine Learning for Developers (ML4Devs Newsletter, Issue 1)

Unleashing the Power of Knowledge Graphs in RAG Applications

Structured Outputs from LLMs: LangChain Output Parsers

The AI Revolution in Test Data Generation