登录查看更多内容

LLMs : Model Size vs Corpus Size vs Corpus Quality

Binayak Dutta

AI Architect : IBM Storage Insights

发布日期: 2023年7月3日

Most of us by this time are familiar with Open AI's Chat GPT, Google's BARD and likes. Large Language Models (LLMs) are found to be promising in scenarios for QnAs, recommendations, summarizations, elaborations, code, image and music generation, mathematical skills and more.

As we speak, the LLMs are evolving at a fast pace. In fact, they are evolving at such a rate that it will be unfair to compare them by generations or even calendar years!!!?A better reference on time scale could be months of the year (if not weeks)

One of the important aspects of model evolution is the model size i.e., the number of parameters that the LLMs are trained on. For reference, BERT (2018) had 340M parameters, GPT-2 (2019) had 1.5 billion and GPT-3(2020) had 175 billion parameters respectively. Increasing the model size improves the breadth of capabilities and effectiveness of output, however it comes with a trade-off with respect to increase in compute, space and time given comparable system and infrastructure. In a separate article, we will explore what these parameters stands for and how they improve the model.

Another similar consideration for improving LLMs is the size of the training corpus. Here the latest of the LLMs are trained on a corpus of approximately 5 trillion tokens. By common logic and based on empirical data it is understood the bigger corpus size helps covers wider variety of context as well as the nuances of different word associations within specific contexts.

For LLMs, to begin with, model parameters (model size) were considered to be the most dominant lever for improvement. However, by 2022, in many scenarios (specially around code generation), size of the training corpus also started gaining attention. (Training Compute-Optimal Large Language Models arXiv:2203.15556v1?[cs.CL])

领英推荐

LLMs, Embeddings, Vector Search and More!

Pavan Belagatti 1 年前

Long-Context LLMs vs Retrieval-Augmented Generation:…

Amita Kapoor 4 周前

LLM-based Survey Autonomous Agents; Evaluating LLM on…

Danny Butvinik 1 年前

Interestingly, 2023 revealed another very important consideration. And that is around the quality of data in the training corpus. Researchers revealed that models if?during unsupervised training phase are trained with text book quality data, can compete or even perform better than models with (relatively) higher parameters and/ or training corpus size?(Textbooks Are All You Need arXiv:2306.11644v1?[cs.CL])?

As enterprises plan to embrace Generative-AI, it will be extremely important for tech. leaders to follow this evolution and understand the tradeoffs that different models brings with respect to their business use cases.

P.S.: This article represent my personal views developed through my experience across various courses, engagements and assignments. It may or may not align with views of my network, present or past employer/s.

Keith Wishart

1 年

Nice work Binayak Dutta!

1 次回应

Gyanesh Shrivastava

Connected Vehicle Software | OTA | SDV | Telemetry

1 年

Great start Binayak. Waiting for the next one!!!

1 次回应

Ramakrishna Vadla

IBM STSM, Global Product Owner, Lead Architect, IBM Storage Management, IBM ISDL

1 年

Very nicely articulated on LLMs.??

1 次回应

Debasish Dash, Prince2?

1 年

This is absolutely just in time Binayak.. Please keep your focus on this, it is going to transform the whole world.

1 次回应

Shankar K Jha

Chief Executive Officer, Ecotech IT Solutions Pvt Ltd (a Weiss GmbH company)

1 年

Good read. Looking forward to the next in this series. Keep posting, BD.

1 次回应

查看更多评论

要查看或添加评论，请登录

Binayak Dutta的更多文章

Need for a chat bot ..

2024年12月30日

Need for a chat bot ..

Continuing from my previous post on AI, lets explore few important reasons why a product may consider to include an AI…

1 条评论
SLMs for Generative AI. What are they ?

2023年12月22日

SLMs for Generative AI. What are they ?

SLMs for Generative-AI ??? While in our world we are still coming to terms to LLMs, here comes SLMs. So, what are they…

7 条评论
LLMs : Inaccuracies / Hallucinations

2023年7月18日

LLMs : Inaccuracies / Hallucinations

Can an LLM provide a factually incorrect response? Can an LLM provide a response which could be factually correct but…

3 条评论
Post-Covid Ways-Of-Working

2020年4月14日

Post-Covid Ways-Of-Working

The present crisis and our response around Covid-19 will have a profound influence on #postCovid #WaysOfWorking. Across…

LLMs : Model Size vs Corpus Size vs Corpus Quality

Binayak Dutta

AI Architect : IBM Storage Insights

领英推荐

Binayak Dutta的更多文章

社区洞察

其他会员也浏览了

LLM and Knowledge Graphs; GPT-4 with Wolfram; CHITA by Google AI; Train ChatGPT on Your Documents via APIs; Why Kindness At Work Pays Off; and More

Top LLM Papers of the Week (October Week 4, 2024)

?????? LLMs Opening Their Inner Eyes

Article - The Rapidly Evolving Landscape of Large Language Models

The LLMOps Lifecycle: Managing Large Language Models Effectively

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Insider's Edit: OpenAI's Tips for Writing Better Prompts

Retriever Augmented Generation (RAG): Enhancing Language Models with External Knowledge

Building vs. Utilizing Existing Large Language Models (LLMs): Considerations for Use Cases and Bias Mitigation

Cortical Algorithms v. Large Language Models

领英推荐

Binayak Dutta的更多文章

Need for a chat bot ..

SLMs for Generative AI. What are they ?

LLMs : Inaccuracies / Hallucinations

Post-Covid Ways-Of-Working

社区洞察

其他会员也浏览了

LLM and Knowledge Graphs; GPT-4 with Wolfram; CHITA by Google AI; Train ChatGPT on Your Documents via APIs; Why Kindness At Work Pays Off; and More

Top LLM Papers of the Week (October Week 4, 2024)

?????? LLMs Opening Their Inner Eyes

Article - The Rapidly Evolving Landscape of Large Language Models

The LLMOps Lifecycle: Managing Large Language Models Effectively

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Insider's Edit: OpenAI's Tips for Writing Better Prompts

Retriever Augmented Generation (RAG): Enhancing Language Models with External Knowledge

Building vs. Utilizing Existing Large Language Models (LLMs): Considerations for Use Cases and Bias Mitigation

Cortical Algorithms v. Large Language Models