登录查看更多内容

GPT 4 - Reverse Engineering

Yam Peleg

Scientist / Entrepreneur | Founder & CEO of Deep Trading

发布日期: 2023年3月16日

+ 关注

Ah, Yes. GPT-4.

As images of two circles, dubbed "Parameters of "GPT-4" spread faster than coronavirus..
And "How to make money with ChatGPT" becomes the most lucrative business ever known..

OpenAI emerges from the shadows..

Introducing GPT-4!

" "Technical" Report "

The colossal marvel is accompanied by a detailed article overflowing with insights into the model's entire research and development journey.

I've taken it upon myself to condense the 97 essay pages into a neat summary for you all:

[ ]

Now, let's dive into everything we know about the model EXCEPT for the "Technical" Report:

Multi-Modality

It is multi-modal: Can receive an image and text allowing it to, for example, Answer questions about the content of the image.

It is very impressive. We can not take it away from them.

Another example is, During the demo, the model was able to answer questions based on academic articles while receiving only a picture of them. [And answer as text].

In the modeling business we call this "A Huge hint as to how the model was trained".

Also during yesterday's demo, the model managed to create a fully functional web page, (With working JavaScript), based on a crudely sketched pen drawing of some user interface.

In the modeling business we call this..

As we all already know from the top computer vision models, multi-modal models are significantly better than single-modal models, training on both texts and images greatly improves model performance because there is valuable information in the intersection of both.

Training on both improves each separately.

How is it on text?

After playing with it all night, I can say that the answers of are more or less the same as ChatGPT on day-to-day usage. simply because ChatGPT is already just that good.

Seriously, in most cases: How is it even possible to answer "better" than how ChatGPT already answers?

But!? And there is a huge "But".

As your task becomes more and more difficult, the differences between the models become clear:

The new model: GPT-4 can pass almost any official standardized test of the US government. Law, medicine, accounting.. can even get accepted to Stanford!
It is al also trained in several languages and performs really well in each of them.

Note: In Hebrew it is significantly better than ChatGPT. But still not perfect.

On standard academic metrics, the model pretty much beats everyone else on everything. Surpasses the current top models such as FLAN-PaLM in every benchmark.

Yes, but we already know that on GPT-3's paper some of the standard "measurements" were.. innovative. [1]

However, there was one particular metric that caught the attention of many.

Hindsight Neglect - A specific test in which larger models are worse. Except for GPT-4. Which gets 100% on it.

On this benchmark, The model is asked sometimes to forget some known fact, like "Let's set Pi to be 4" [As engineers do anyway], and then the model is tasked with solving some mathematical questions with the new Pi.

As it turns out, on this benchmark: As models get bigger they tend to also get worse.

How did they beat this test? I'm guessing they just created specially tailored synthetic data such that the model would be able to understand these types of problems much better.

Controlling the model's "Personality"

ChatGPT is rambling too much. right?

Well, this model has been specially trained so that its "personality" is controllable, it can be told in advance to answer in a specific way through "system commands" The same commands that control the model behind the scenes and were introduced with the ChatGPT API.

And this leads us to..

Conspiracy confirmed: GPT-4 == Bing!

Confirmation: Bing (Sydney ??) was GPT-4 all along. Conspiracy confirmed.

Training on long texts

The model supports up to 32,000 tokens which are around 50 Pages of human language.

In my opinion: The length of the input is one of the most critical aspects of the whole field: if you had access to a model of enormous length, you could simply "paste" an entire book into the prompt or even an entire database and ask it questions.

How did they do it?

Contrary to what many believe, transformers actually work really well if you train them with longer sequences. I do it every day.

So given that GPT-3 has the same architecture as GPT-2 except it doesn't have the same architecture and instead it is a Sparse Transformer.. [2]

Which was also upgraded with relative attention at some point.. [3]

..We can assume that it can handle longer texts just fine. So they probably just trained also on longer texts without making too many changes.

From experience: I can tell you that training on an entire book or in general on very long texts also dramatically improves the model even on shorter texts, Probably because it can also capture some longer dependencies that are only represented in very long texts such as full books.

I guess they probably just trained the model on a large number of longer texts as well. Nothing fancy.?

How is it that GPT-4 knows so much?

Although we don't know much about the training data, there are some facts about data that we do know.

领英推荐

Prompt Engineering 101 - Introduction and resources

Xavier (Xavi) Amatriain 2 年前

The Art of Prompt Engineering

Sanjay Kumar MBA,MS,PhD 4 个月前

DeepSeek: All You Need to Know

Amit Kurhekar ?? 1 个月前

Not from the paper.

First of all, we know that OpenAI has added some "synthetic conversation samples" to the dataset [4] - Humans who simulate both sides of the conversation and show the model how a proper conversation should take place.

Additionally, we know from past OpenAI papers that they have a large pool of "instructions" coming from real people and that these instructions are fundamentally different from the open academic instructions datasets (Like FLAN).

We have some examples of such instructions [7]

Also, we know that many people with specific domain expertise in a wide range of areas had been employed by OpenAI in roles of "Model QA".?

Chemists and mathematicians for example would work closely with the model to correct it when it was wrong. These teams of domain experts probably "forged" the model in their areas of expertise.

Side Note: You can easily find many of them nowadays on Twitter sharing what was it like.

I am guessing that the specific areas in which the model underwent this type of "oversight" are somewhat similar to the areas in which the model passes the standard licensing tests.

The new model is much better with facts

It "invents" links and other unique texts significantly less than GPT-3.

GPT -4, like its predecessors, also has a tendency to "invent" information or produce incorrect content, this content may be harmful because humans sometimes rely excessively on these types of models.

This topic becomes more and more relevant as large models become more and more human-like and are integrated into people's daily lives. To measure the severity of this problem in GPT-4, the researchers constructed a set of automated and human assessments based on real-world data.

The model has been optimized in order to reduce these "inventions", among other things, by using data collected from previous models such as ChatGPT.

OK. But how did they actually do it?

Models tend to "invent" more in areas that they don't understand well enough.

In my opinion: OpenAI simply trained the models for much longer using much more accurate data.

I also suspect that during developement, once they encountered a few "inventions" they simply added the correct data they wanted the model to know about into the training set and tested again.

More interesting: As per OpenAI, most of the "knowledge" of the model comes from the Pretraining phase and not from the RLHF phase. The RLHF only teaches the model to follow instructions in human language and not the additional required knowledge. According to them, overtraining with RLHF even hurts the performance of the model.

I confirm this. I see the same on my experiments.

Compute Power

The number of GPUs used in the training of GPT-4 is unknown, but some estimates suggest that this number could be more than 15,000.

In May 2020, Microsoft announced that it had built a cluster with 10,000 GPUs for OpenAI, it is believed to be known as the one that trained GPT-3.

With the release of GPT-3.5 and ChatGPT in late 2022, it is likely that the number of GPUs has increased since then.

According to Morgan Stanley, it was estimated that GPT-4 finished training last August. [6]

This matches exactly with the date posted yesterday by OpenAI about the end date of GPT-4's training.

Morgan Stanley also estimated that GPT-5 is currently training on 25,000 GPUs, with most of these GPUs also being used for GPT-4. [6]

Loss prediction prior to training

An interesting part of the paper is the ability of OpenAI to predict the loss of the model based on the computational power invested in the model. loss of huge models like GPT-4 is the main metric for measuring these models' performance and given that training a model of this type costs millions of dollars, it is useful to know in advance whether the model would be "good".

By developing a large training infrastructure and scalable optimization methods for model training, the researchers were able to accurately predict the model's final loss and the model's future capabilities based on the results of training small models. In the case of GPT-4, the team was able to successfully predict the model's final loss and the model's outcome performance on the HumanEval benchmark.

Slightly Interesting: In some past scaling laws expressions used for calculating loss forecasts there is a constant component that cannot be reduced. Some people on the internet have already rushed to call this component "irreducible intelligence" and that a model that will reach this loss is "AGI".

Highlights from the paper

Another interesting topic that is important not to miss are the parts that the "Technical" paper did choose to include.

Training infrastructure at the center: Training models of such magnitudes on a huge amount of data is not an easy task at all. The paper emphasizes the infrastructures several times (Also among the only ones who even talk about the subject in an academic article).
The number of people who deal with data: In the paper, there are three pages of credits detailing each of the authors, one by one, what they contributed. It is easy to see that the vast majority of people messed with the data. Also, pay attention to the division of roles, it nicely describes all the steps that went through in order to clean the data. Whole teams of people dealt with leaks in the validation data.

High quality data is the "secret". End of story.

Dozens of people on "Optimization for the training process" and "Training with RLHF": One can only guess how many "tricks" those people had put into the model in order for it to reach the results we see.

How come only OpenAI can pull this off?

I would like to ask you a question, have you ever tried any open model in the interface of Huggingface that actually impressed you?

Me neither.

The Harsh Truth: If you were to try and use the open solutions (eg Huggingface's Trainer) to train some open GPT model (eg GPT-2), it just won't work.

They turn out "okay"ish.. kind off.. nothing more.

The reason for that is that a tremendous amount of tricks and optimizations go behind the scenes and are kept away from open papers to actually train such models to a truly impressive result.

This is regardless of the data it is trained on or the size of the model.

In the picture: GPT-5 will have a gazillion + 1 Parameters!

References:

[1] - "Creative" measurements in GPT-3's paper - https://twitter.com/suchenzang/status/1617093563061522432

[2] - GPT-3 Is a sparse transformer - https://paperswithcode.com/method/gpt-3

[3] - GPT-3 Upgraded to relative attention - https://arxiv.org/abs/2207.14255

[4] - ChatGPT was trained also on synthetic conversations - https://openai.com/blog/chatgpt

[5] - Microsoft builds 10,000 GPUs cluster: https://news.microsoft.com/source/features/ai/openai-azure-supercomputer/

[6] - Morgan Stanley estimates: https://www.reddit.com/r/MachineLearning/comments/tdytxf/d_gpt5_trained_on_25000_gpus/

[7] - Instructions collected by OpenAI: https://github.com/openai/following-instructions-human-feedback/tree/main/automatic-eval-samples

Aron Brand

CTO at CTERA · Inventor, 40+ Patents · Advisor · Generative AI · Cybersecurity · Cloud Storage

2 年

Great writing , thank you Yam.

2 次回应

Noam Brand

2 年

Aron Brand

2 次回应

查看更多评论

要查看或添加评论，请登录

Yam Peleg的更多文章

???? ????? ?????? ???? ?????

2024年5月14日

???? ????? ?????? ???? ?????

????? ??? ??????? ?-76 ?????? ?????, ??? ????? ??? ?? ???? ????? ?????? ???? ????? ????! ????? ???? ?????? ???????…

4 条评论
????? ????? ???? ????? ??????

2024年4月29日

????? ????? ???? ????? ??????

??? ??? ???? ?? ???? ???? ???? ????? ??????? ????? ?? ??? ??????. !???? ?? ??????? ????? ??????? ????? ???? ???? ????…

11 条评论
The Missing Pieces of GPT-4

2023年6月16日

The Missing Pieces of GPT-4

The Missing Pieces of GPT-4 Some overlooked tricks & The future of open source LLMs. > TL;DR: Reinforcement Learning…
Reading Between the Lines of OpenAI's Israel trip

2023年6月6日

Reading Between the Lines of OpenAI's Israel trip

OpenAI's management held several meetings yesterday in Tel Aviv. Those meetings were concluded by a highly publicized…

3 条评论

GPT 4 - Reverse Engineering

Yam Peleg

Scientist / Entrepreneur | Founder & CEO of Deep Trading

Introducing GPT-4!

" "Technical" Report "

Multi-Modality

How is it on text?

Controlling the model's "Personality"

Conspiracy confirmed: GPT-4 == Bing!

Training on long texts

How did they do it?

How is it that GPT-4 knows so much?

领英推荐

The new model is much better with facts

OK. But how did they actually do it?

Compute Power

Loss prediction prior to training

Highlights from the paper

How come only OpenAI can pull this off?

Yam Peleg的更多文章

社区洞察

其他会员也浏览了

The Alchemy of Asking: How Prompt Engineering Is Redefining Success in the AI Era

Prompt Engineering 101: Master AI Conversations Like a Pro!

Five Ways to Address the Alignment Problem

Mastering AI Reasoning: The Training Evolution of DeepSeek R1

Unleashing the Power of AI Language Models in Business

AutoML and LLMOps - Can AI Optimize AI?

The New Wave in AI: A Method That Teaches Computers with Fewer Instructions

FuturProof #235: AI Technical Review (Part 7) - Fine Tuning

Impossible Distillation: How to Make High-quality Lemonade out of Small, Low-quality Model.

Balancing risk and reward: how to responsibly use generative AI for business

Introducing GPT-4!

" "Technical" Report "

Multi-Modality

How is it on text?

Controlling the model's "Personality"

Conspiracy confirmed: GPT-4 == Bing!

Training on long texts

How did they do it?

How is it that GPT-4 knows so much?

领英推荐

The new model is much better with facts

OK. But how did they actually do it?

Compute Power

Loss prediction prior to training

Highlights from the paper

How come only OpenAI can pull this off?

Yam Peleg的更多文章

???? ????? ?????? ???? ?????

????? ????? ???? ????? ??????

The Missing Pieces of GPT-4

Reading Between the Lines of OpenAI's Israel trip

社区洞察

其他会员也浏览了

The Alchemy of Asking: How Prompt Engineering Is Redefining Success in the AI Era

Prompt Engineering 101: Master AI Conversations Like a Pro!

Five Ways to Address the Alignment Problem

Mastering AI Reasoning: The Training Evolution of DeepSeek R1

Unleashing the Power of AI Language Models in Business

AutoML and LLMOps - Can AI Optimize AI?

The New Wave in AI: A Method That Teaches Computers with Fewer Instructions

FuturProof #235: AI Technical Review (Part 7) - Fine Tuning

Impossible Distillation: How to Make High-quality Lemonade out of Small, Low-quality Model.

Balancing risk and reward: how to responsibly use generative AI for business