登录查看更多内容

Is 3 better than 2? Better be!

Ritesh Vajariya

All things AI | C-Suite Advisor | Thought Leader | Keynote Speaker | Author | Cerebras | ex-AWS

发布日期: 2024年4月19日

A new baby has been born in META family, exactly after 9 months! META released their Llama 2 family of models on July 18th, 2023 and now Llama 3 models (at least 2 models — one more is still training!) were released yesterday, April 18th, 2024.

We are well aware of Mark Zuckerberg’s intentions to create open-source AGI (Artificial General Intelligence) when he shared updates on Instagram about 13 weeks ago. These two models in Llama 3 family are their iterative approach to reach open-source AGI in few years.

In this post, I will review what META released, how it compare to their previous babies, Llama 2. Additionally, I have started a tracker where I track the scale and performance of various models - starting with most promising 10 open-source and closed-source models.

Models:

The architecture of model remains more or less same decoder-only transformer architecture (read more on decoder-only transformer architecture ). They have created a tokenizer with vocabulary of 128K tokens that gives them improved model performance.

Datasets:

They have used more than 15 trillion tokens to train Llama 3 model.
In comparison, Llama 2 used mere 2 trillion tokens for their training.
In layman terms, 2 trillion tokens are approx. 20 million books so with 15 trillion tokens in Llama 3, we are looking at equivalent of whopping 140 million books.
New York public library has approx. 45 million research items (including books) so to train this Llama 3 models, META cobbled together at least 4 New York public libraries worth of information - I wonder from where this data coming… It’s definitely not coming from you and me using Facebook, WhatsApp and Instagram - at least META is saying that publicly that they are not using their user data to train these models.
They do include 5% of their data (750B tokens) high-quality non-English data that covers over 30 languages
Does that mean this is a multi-lingual model? No. It’s English only model. But you can fine-tune this model to make it multi-lingual.

Compute:

META used their world-class 24K GPU clusters to train these models.
They trained it on 16K GPU simultaneously. I think this is the highest GPU cluster size known publicly for model training.
From performance point of view, they were able to achieve 400 TFLOPs per GPU (H100-80G) which is 40% utilization as NVIDIA H100 has 1000 TFLOPs of FP16 performance. This is a very good utilization factor - in fact there are only handful of engineers who are able to extract such performance from GPUs at this scale. Bravo!

Cost:

8B model was trained with 1.3M GPU hours and 70B model was trained with 6.4M GPU hours. At $12.5/GPU (on-demand cost at AWS) 8B model cost would have been USD $16 million and for 70B the cost would have been USD $80 million (not bad compared to Google’s Gemini Ultra cost USD $191 million!!) This is just a compute cost to train the model — not including any experiments, data preparation cost and the biggest of all, employees cost.
let’s look at the employee cost: Based on the public information, there are about 270 people has been credited to build these Llama 3 models, including Mark Zuckerberg himself. With an average salary of $250K/year, we are looking at USD $50 million to deliver this baby (9 months!) - wow! just wow!!

Performance:

clearly Llama 3 has an edge over Llama 2 models - in some cases significant gain of 15 to 20 basis point when we are comparing general benchmarks, such MMLU, etc. (read more about what all these benchmarks are )
I believe two reasons for improved performance: 1) 7x increase in the datasets, so model has learned a lot more than it’s predecessor. 2) smaller model increased from 7B to 7B. That 1B parameters is a big jump - at least 13.5% so that should contribute to the increase.
In comparison to non-META models, in open-source game, Llama 3 8B has an edge compared to it’s 7B cousins from Mistral and Google. Similarly, Llama 3 70B fares well compared to their cousins from Google Gemini Pro 1.0 and Mixtral 8x22B (which was just released two days ago!)

Other things:

context window of base models has doubled from 4K tokens to 8K tokens. This context size seems very small in comparison to closed-source models where 200K is a standard with up to 1M tokens. But IYKYK that Llama 3 being an open model (will talk about license in a sec) you can extend the token length to your desire - kind of after-market hack ??
On license, it’s still remain META special with permission to use for research and commercial - as long as you don’t have monthly active users over 700 million. Most enterprise won’t fit in this scenario but with Llama 3, there are certain change of terms like, push the branding of “built with Llama 3” as the Meta brand strategy.
Another thing which they missed to do last time is simultaneous release. If I remember, with Llama 2, they launched on Hugging Face and then Microsoft and in parallel on AWS, or day after. But this time, it's launched everywhere at the same time - great job SageMaker team and Ankur Mehrotra on getting this on the release date itself.

OpenCV 10 个月前

This AI newsletter is all you need | #2

Towards AI 2 年前

Feature Store Architecture, the Year of Large Language…

Open Data Science Conference (ODSC) 12 个月前

First-hand unboxing:

Along with the launch of Llama 3, they also enabled their meta.ai website with chat powered by Llama 3 70B model. In my test, I am satisfied with the result (read: very good for a free model)

Here are my result with meta.ai (it covered the gist of what Cerebras does with some minor omissions)

Here is my result with same prompt for Claude Opus (missed model service offering which was covered by Llama 3):

LLM Tracker:

With so many models being released every week, it’s not easy for me to track what’s happening in the world and I am sure many of you as well.. so to track each other’s relative performance, I have started an LLM tracker. For now, I am tracking MMLU score of last 10 or so most performant model and as we get more, will keep updating them.

author's tracking mechanism of LLM performance

Conclusion

It's amazing to see META can deliver back-to-back higher quality models in less than 12 months. They are still training their biggest 400 billion parameter model. That may break many other records. With all the advancement happening in the open-source LLM, we are not far behind the closed-source and the rivalry has just begin. All these means is future is bright and full of artificial intelligence driven advancement! What do you think?

Shameless plug:

Do you know someone who can benefit by learning the fundamentals of Artificial Intelligence (AI) and Machine Learning (ML)? You are in luck!

I have created a fundamental course on AI/ML where I explain this complex topic is the most simply way - some of my students calls it "oversimplifying"!

Click on this link and gift them the course - and yes, they do not need technical background. I mean it - else they get their money back! Guarantee!

Is 3 better than 2? Better be!

Ritesh Vajariya

All things AI | C-Suite Advisor | Thought Leader | Keynote Speaker | Author | Cerebras | ex-AWS

Models:

Datasets:

Compute:

Cost:

Performance:

Other things:

领英推荐

First-hand unboxing:

LLM Tracker:

AI with Ritesh (AI Guru)

1,758 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Colossal-AI Accelerates AIGC, Snowpark for Python Is GA and the New Beta Release of NVIDIA Omniverse

Issue #220 - THE ML ENGINEER ??

The Dangerous Delusion of Digital Spatial Intelligence: Silicon Valley's Trillion-Dollar Mistake

Web ML Monthly #14: India loves TensorFlow.js, 3 new demos, Meta AI runs segment anything in browser!

????Day 4 - New Year's Eve Special ??

Exciting Advances in Generative AI with Mistral's Codestral, Google's AGREE, NVIDIA's Blackwell, and More

How to build a Multi-Stage Recommender System

Exploring the Capabilities of Gemma: Google Cloud's Latest AI Innovation

The Leading Edge #11

From BASIC Coding To Generative Artificial Intelligence (AI): A Technology Journey Almost 40 Years In The Making For My 12-Year-Old Self.

Models:

Datasets:

Compute:

Cost:

Performance:

Other things:

领英推荐

First-hand unboxing:

LLM Tracker:

AI with Ritesh (AI Guru)

1,758 位关注者

From Pit Stop to Pole Position: AI's Ferrari Moment

2024年11月19日

AI's unprecedented progress

2024年10月23日

Fast Inference in Generative AI: A Game Changer

2024年9月13日

Revolutionizing Education Through Multisensory AI

2024年8月18日

Sight, Sound, and Strategy: How Multimodal AI is Reshaping Business

2024年8月10日

AI Strategy for All: Free Access to Revolutionary Planning Tools

2024年8月7日

Meta's Llama 3.1: Democratizing AI

2024年7月24日

Beyond LLM

2024年7月9日

Prompt Engineering: The Key to Effective Generative AI

2024年6月20日

The Risk Manager's Playbook: Strategies for Generative AI

2024年6月7日

社区洞察

其他会员也浏览了

Colossal-AI Accelerates AIGC, Snowpark for Python Is GA and the New Beta Release of NVIDIA Omniverse

Issue #220 - THE ML ENGINEER ??

The Dangerous Delusion of Digital Spatial Intelligence: Silicon Valley's Trillion-Dollar Mistake

Web ML Monthly #14: India loves TensorFlow.js, 3 new demos, Meta AI runs segment anything in browser!

????Day 4 - New Year's Eve Special ??

Exciting Advances in Generative AI with Mistral's Codestral, Google's AGREE, NVIDIA's Blackwell, and More

How to build a Multi-Stage Recommender System

Exploring the Capabilities of Gemma: Google Cloud's Latest AI Innovation

The Leading Edge #11

From BASIC Coding To Generative Artificial Intelligence (AI): A Technology Journey Almost 40 Years In The Making For My 12-Year-Old Self.