LLMs : Model Size vs Corpus Size vs Corpus Quality
Most of us by this time are familiar with Open AI's Chat GPT, Google's BARD and likes. Large Language Models (LLMs) are found to be promising in scenarios for QnAs, recommendations, summarizations, elaborations, code, image and music generation, mathematical skills and more.
As we speak, the LLMs are evolving at a fast pace. In fact, they are evolving at such a rate that it will be unfair to compare them by generations or even calendar years!!!?A better reference on time scale could be months of the year (if not weeks)
One of the important aspects of model evolution is the model size i.e., the number of parameters that the LLMs are trained on. For reference, BERT (2018) had 340M parameters, GPT-2 (2019) had 1.5 billion and GPT-3(2020) had 175 billion parameters respectively. Increasing the model size improves the breadth of capabilities and effectiveness of output, however it comes with a trade-off with respect to increase in compute, space and time given comparable system and infrastructure. In a separate article, we will explore what these parameters stands for and how they improve the model.
Another similar consideration for improving LLMs is the size of the training corpus. Here the latest of the LLMs are trained on a corpus of approximately 5 trillion tokens. By common logic and based on empirical data it is understood the bigger corpus size helps covers wider variety of context as well as the nuances of different word associations within specific contexts.
For LLMs, to begin with, model parameters (model size) were considered to be the most dominant lever for improvement. However, by 2022, in many scenarios (specially around code generation), size of the training corpus also started gaining attention. (Training Compute-Optimal Large Language Models arXiv:2203.15556v1?[cs.CL])
领英推荐
Interestingly, 2023 revealed another very important consideration. And that is around the quality of data in the training corpus. Researchers revealed that models if?during unsupervised training phase are trained with text book quality data, can compete or even perform better than models with (relatively) higher parameters and/ or training corpus size?(Textbooks Are All You Need arXiv:2306.11644v1?[cs.CL])?
As enterprises plan to embrace Generative-AI, it will be extremely important for tech. leaders to follow this evolution and understand the tradeoffs that different models brings with respect to their business use cases.
P.S.: This article represent my personal views developed through my experience across various courses, engagements and assignments. It may or may not align with views of my network, present or past employer/s.
Nice work Binayak Dutta!
Connected Vehicle Software | OTA | SDV | Telemetry
1 年Great start Binayak. Waiting for the next one!!!
IBM STSM, Global Product Owner, Lead Architect, IBM Storage Management, IBM ISDL
1 年Very nicely articulated on LLMs.??
Product Dreamer (Conceived and Monetized 3 Enterprise Grade Products) | Digital Transformation Adviser | Clean Energy & Sustainability | Entrepreneur | E&U Thought Leader | Customer Advocate | People Manager
1 年This is absolutely just in time Binayak.. Please keep your focus on this, it is going to transform the whole world.
Chief Executive Officer, Ecotech IT Solutions Pvt Ltd (a Weiss GmbH company)
1 年Good read. Looking forward to the next in this series. Keep posting, BD.