LLMs : Model Size vs Corpus Size vs Corpus Quality

Most of us by this time are familiar with Open AI's Chat GPT, Google's BARD and likes. Large Language Models (LLMs) are found to be promising in scenarios for QnAs, recommendations, summarizations, elaborations, code, image and music generation, mathematical skills and more.

As we speak, the LLMs are evolving at a fast pace. In fact, they are evolving at such a rate that it will be unfair to compare them by generations or even calendar years!!!?A better reference on time scale could be months of the year (if not weeks)

One of the important aspects of model evolution is the model size i.e., the number of parameters that the LLMs are trained on. For reference, BERT (2018) had 340M parameters, GPT-2 (2019) had 1.5 billion and GPT-3(2020) had 175 billion parameters respectively. Increasing the model size improves the breadth of capabilities and effectiveness of output, however it comes with a trade-off with respect to increase in compute, space and time given comparable system and infrastructure. In a separate article, we will explore what these parameters stands for and how they improve the model.

Another similar consideration for improving LLMs is the size of the training corpus. Here the latest of the LLMs are trained on a corpus of approximately 5 trillion tokens. By common logic and based on empirical data it is understood the bigger corpus size helps covers wider variety of context as well as the nuances of different word associations within specific contexts.

For LLMs, to begin with, model parameters (model size) were considered to be the most dominant lever for improvement. However, by 2022, in many scenarios (specially around code generation), size of the training corpus also started gaining attention. (Training Compute-Optimal Large Language Models arXiv:2203.15556v1?[cs.CL])

Interestingly, 2023 revealed another very important consideration. And that is around the quality of data in the training corpus. Researchers revealed that models if?during unsupervised training phase are trained with text book quality data, can compete or even perform better than models with (relatively) higher parameters and/ or training corpus size?(Textbooks Are All You Need arXiv:2306.11644v1?[cs.CL])?

As enterprises plan to embrace Generative-AI, it will be extremely important for tech. leaders to follow this evolution and understand the tradeoffs that different models brings with respect to their business use cases.

P.S.: This article represent my personal views developed through my experience across various courses, engagements and assignments. It may or may not align with views of my network, present or past employer/s.


Gyanesh Shrivastava

Connected Vehicle Software | OTA | SDV | Telemetry

1 年

Great start Binayak. Waiting for the next one!!!

Ramakrishna Vadla

IBM STSM, Global Product Owner, Lead Architect, IBM Storage Management, IBM ISDL

1 年

Very nicely articulated on LLMs.??

Debasish Dash, Prince2?

Product Dreamer (Conceived and Monetized 3 Enterprise Grade Products) | Digital Transformation Adviser | Clean Energy & Sustainability | Entrepreneur | E&U Thought Leader | Customer Advocate | People Manager

1 年

This is absolutely just in time Binayak.. Please keep your focus on this, it is going to transform the whole world.

Shankar K Jha

Chief Executive Officer, Ecotech IT Solutions Pvt Ltd (a Weiss GmbH company)

1 年

Good read. Looking forward to the next in this series. Keep posting, BD.

要查看或添加评论,请登录

Binayak Dutta的更多文章

  • Need for a chat bot ..

    Need for a chat bot ..

    Continuing from my previous post on AI, lets explore few important reasons why a product may consider to include an AI…

    1 条评论
  • SLMs for Generative AI. What are they ?

    SLMs for Generative AI. What are they ?

    SLMs for Generative-AI ??? While in our world we are still coming to terms to LLMs, here comes SLMs. So, what are they…

    7 条评论
  • LLMs : Inaccuracies / Hallucinations

    LLMs : Inaccuracies / Hallucinations

    Can an LLM provide a factually incorrect response? Can an LLM provide a response which could be factually correct but…

    3 条评论
  • Post-Covid Ways-Of-Working

    Post-Covid Ways-Of-Working

    The present crisis and our response around Covid-19 will have a profound influence on #postCovid #WaysOfWorking. Across…

社区洞察

其他会员也浏览了