Why is Gen AI so Complex?
LLM Fine-tuning Architecture [1]

Why is Gen AI so Complex?

"Please explain #Transformers to me in 2-minutes"

This is often how a conversation starts in today's #GenAI hyped world. Unfortunately, I find myself incapable of doing this. The Transformer architecture itself is such a complex topic that I learn something new every time I read the paper again. Plus, there are so many additional concepts to understand to even get a grasp of Gen AI/LLMs, e.g., Tokenizers, Byte-Pair Encoding (BPE), Supervised Fine-tuning (SFT), Vector DB, Retrieval Augmented Generation (RAG), Reinforcement Learning from Human Feedback (RLHF), Prompting Strategy, Chain-of-Thought (CoT), Diffusion models.

So I thought of trying to summarize the key points that come to mind when I think of the state-of-the-art of Gen AI/#LLMs today and see if I can somehow tie all the above concepts into a storyline. This is going to be a techno-jargon heavy post and

(*Disclaimer) by the end of this post I realised "Wow, this is so much to learn and I know so little - too much happening too fast!"

---

*There is not much difference between the original #Transformer neural network architecture (2017 paper: Attention is All You Need) and the one fuelling modern GPTs. The original 2017 paper focused on language translation, and hence used encoder-decoder architecture. GPTs are decoder only architecture as their primary?goal is sentence completion. Yes, there is (a lot more) data and layers, but architecturally nothing much has changed.

---

*If you want to sound a bit more geeky, encoder-decoder requires #cross-attention as it is trying to capture the relationship between source and target language tokens. Self-attention is sufficient for decoder only architectures.

---

*Text needs to be converted into numerical #vectors to be processed by computers and this is done using Tokenizers. #Tokenization plays a key role in LLM training,?however building (training) Tokenizers is a separate?line of work. A token can be a byte, character, set of characters, words, or even full sentences. Byte-Pair Encoding (#BPE) is the most common Tokenizer today using a pair of adjoining bytes as a Token.?

Selecting the right 'token' is key as it impacts both the inter-token relationships that the neural network will be able to grasp, and the computational complexity of training that network.?

---

*ChatGPT is is not the same as GPT: ChatGPT is not a foundational LLM nor has it been built in a fully unsupervised fashion. Supervised Fine-tuning (#SFT) has already been applied to build ChatGPT.

Reasoning: GPT-3, e.g., is a foundational LLM pre-trained on unstructured/unlabeled data in an unsupervised fashion. However, it can only complete sentences.

SFT is then performed with labeled query-response pairs in the form of a Chat, to build a task specific Assistant model, such?as ChatGPT.?

Yes, it is possible to further fine-tune ChatGPT, but that does not make ChatGPT a foundational LLM.

---

*Yes, data is needed to enable Gen AI! However, we need to differentiate?between public and #proprietary enterprise data. Public data (a lot of it) is needed to train foundational LLMs. You can use them directly to, e.g., chat, summarize, and generate images. For example, you can use it out of the box to process, e.g. NDAs, which look more or less the same across enterprises.?Your marketing team can use it to generate promotional material without any fine-tuning, in most cases.

*Enterprise data can be a differentiator adding strategic?value, however they are only needed when you?want to?fine-tune LLMs - when you need responses customized to your enterprise domain/context.

This enterprise context can be provided via RAGs, which is usually much easier to implement than SFT; as it does not require any labeled data or neural network fine-tuning.

#RAFT “Retrieval-Augmented Fine-Tuning” is a new technique combining RAGs with domain adaptive fine-tuning.

You will also hear a lot of talk about #DataQuality (DQ), and justifiably so - "garbage in, garbage out". However, it is important to again note here the difference between Gen AI and Business Intelligence (BI)/Predictive Analytics. Traditionally, DQ tools and frameworks have focused on structured data (SQL) - good quality enterprise data is absolutely needed for your financial reports and supply chain forecasts. However, for Gen AI, as explained above some use-cases might be possible out of the box, even if your enterprise data is messy.

Gen AI/LLMs mostly focus on unstructured data, so you will now need to enhance your ERP/Database/Datawarehouse oriented DQ pipelines to also include unstructured data. Start with establishing metadata tagging standards for documents -:)

---

*Retrieval Augmented Generation (#RAG) architectures are the craze these days. In simple terms, RAG is saying that in addition to the Prompt, I am also giving you (ChatGPT) the 5 pages where I believe the answer lies as additional context - so generate your responses accordingly. It is easy to see why RAGs are a natural solution for limiting #hallucinations as well, as the responses are constrained by user provided context.

---

*You might be hearing a lot about #VectorDBs as well, and we can explain them in a RAG context as follows:?

1. Pre-process documents - to (numerical) vector embeddings - store/index them in a Vector DB

2. During prompting, search the Vector DB based on the prompt (using vector similarity search) to retrieve the relevant embeddings.

3. Prompt the?underlying LLM together with the retrieved embeddings as additional context - to generate contextualized responses.

Given their utility, you can use either a specialized Vector DB or a more traditional SQL DB (e.g., Postgres) extended with Pgvector: a vector index and supporting vector similarity search.

---

*Prompting strategy is important and #PromptEngineering has emerged as a field for a reason. GPTs are still not fully autonomous and it seems that they perform better if we tell them how to solve a complex problem by providing prompts in structured manned. Chain-of-Thought (#CoT) has emerged as a powerful pattern here to get results for complex queries with few shot prompting by outlining intermediate reasoning steps.

Given that most GPTs use prompt based #pricing with context window playing a key role, it makes sense to try and optimize the (number of) prompts. Caching can be used to reduce the number of invocations. The additional context provided by RAGs can also help in reducing the number of prompts needed to reach the desired answer by providing additional/relevant context.

---

*Let's now come to my favourite topic: #ReinforcementLearning, or Reinforcement Learning from Human Feedback (#RLHF) in the context of LLM training. So in addition to SFT, an additional step is added at the end using human feedback to further fine-tune the ChatGPT responses. The ChatGPT training pipeline looks like:

Pre-trained foundational LLM (e.g., GPT) ---> apply SFT to build Task specific Assistant Model (e.g., ChatGPT) ---> Further fine-tune using RLHF to improve the quality of ChatGPT Responses.

Reinforcement Learning (RL) is a powerful technique that is able to achieve complex goals by maximizing a Reward function in real-time. So RLHF deals with mapping human feedback - in terms of ratings provided by human labellers to generated responses - to the RL Reward function. Unfortunately, this is easier said than done as humans can be notoriously inconsistent with their ratings.

Proximal Policy Optimization (#PPO) is the thread connecting this (Human feedback enabled) RL Reward model to LLM training. Without any guardrails, the language model can dramatically change its weights in an effort to “game” the Reward model. PPO provides a more stable means of updating the AI agent’s policy by limiting the policy updates iteration. #RLAIF is the new kind on the block replacing the human 'H' in RLHF by 'AI' - one LLM rating the responses generated by another LLM. While cost effective, its effectiveness needs to be further studied with respect to RLHF.

---

*Given the whole Gen AI revolution started with Image generation, let us take a few minutes to understand the key capabilities here. GPT-4 (and above, and the likes of it from Google etc.) are #multimodal. Multi-modal means that the same model can process both text and image data.

However, while GPT-4 today can understand both text and image inputs, it relies on DALL-E to generate the output images.

Image generation started with Generative Adversarial Networks (#GANs), which consist of a pair of Generator and Classifier neural networks competing with each other to generate high quality synthetic images. The Classifier is a discriminator network capable of distinguishing samples as either coming from the actual distribution or from the Generator. Every time the Classifier is able to tell a fake image, i.e., it notices a difference between the two distributions; the Generator adjusts its parameters accordingly. And, so it continues, until the Generator gets a good understanding of the training image set.

GANs got upended by #Diffusion models in 2015, which are more peaceful (read: easier to train, and do not use competing neural networks) though are more computationally complex to train requiring larger datasets. The idea is to first add "gaussian" noise to the images until they become somewhat indistinguishable from the original images. For example, let us say that you take a picture of your house and blur parts of it. Now, if the #VisionTransformer network can denoise it based on its understanding of house image features, and identify the original - we can say that it has gained sufficient understanding of the training image set to be able to generate new images.

---

*Any Gen AI discussion today would not be complete without mentioning #LLMOps and #ResponsibleAI practices around explainability, bias/fairness, privacy & accountability. However, given that I have written about them in detail recently, let me leave you with pointers to the articles for now:

  1. D. Biswas. LLMOps - Generative AI Architectural Patterns. https://www.dhirubhai.net/pulse/generative-ai-architectural-patterns-debmalya-biswas-hlvye/
  2. D. Biswas. Responsible Generative AI Design Patterns. https://www.dhirubhai.net/pulse/responsible-generative-ai-design-patterns-debmalya-biswas-lzpce/


Julinda Gllavata

Head of Data Analytics & AI - Passionate about inclusive leadership and data science

7 个月

Jakob Richi Alicja Kocieniewska fyi

Krishnan Sankarasubramanian

Principal Consultant @ Wipro Digital | Strategy Consulting, Corporate Strategy | Pre-Sales and Growth | Transformation delivery

7 个月

Debmalya Biswas an excellent primer on #transformers. It took me back to my master’s research work back in 2002 , my prof and I were building an NLP solution for Word Sense Disambiguation ( #WSD). The model I used even back than was to #tokenise word collocation and training a #knowledgebase using this initially for #kanji post scripts. At the highest level the principle is still the same for #decoder specially. Although I would like to think there was no generative AI until #matrix ????

Manish K.

Big Data Architect @ Accenture | Building Next-Gen Data Platforms

8 个月

Tokenization, led by methods like Byte Pair Encoding (BPE), is crucial for numerical conversion of text. Models like ChatGPT leverage Supervised Finetuning (SFT) for task-specific enhancement. Retrieval Augmented Generation (RAG) architectures offer contextualized responses, countering hallucinations. In essence, Transformers remain foundational in modern AI advancements, shaping diverse applications in the GenAI field. Debmalya Biswas

回复
Chuck R.

Founder, Collective Intelligence

8 个月

It’s not. ‘Keep on studying’ is my advice to anyone that wants to understand GenAI. Give it six months - time is all you need.

Damien Lopez

Head of Technology and Innovation at Decision Lab UK

8 个月

I think one thing people tend to forget is that ChatGPT is a prime example of RL with human Feedback. And I love that my favorite version of ML is the back bone of everyone’s favorite AI.

要查看或添加评论,请登录

Debmalya Biswas的更多文章

社区洞察

其他会员也浏览了