GPT 4 - Reverse Engineering
Ah, Yes. GPT-4.
OpenAI emerges from the shadows..
Introducing GPT-4!
" "Technical" Report "
The colossal marvel is accompanied by a detailed article overflowing with insights into the model's entire research and development journey.
I've taken it upon myself to condense the 97 essay pages into a neat summary for you all:
[ ]
Now, let's dive into everything we know about the model EXCEPT for the "Technical" Report:
Multi-Modality
It is multi-modal: Can receive an image and text allowing it to, for example, Answer questions about the content of the image.
It is very impressive. We can not take it away from them.
Another example is, During the demo, the model was able to answer questions based on academic articles while receiving only a picture of them. [And answer as text].
In the modeling business we call this "A Huge hint as to how the model was trained".
Also during yesterday's demo, the model managed to create a fully functional web page, (With working JavaScript), based on a crudely sketched pen drawing of some user interface.
In the modeling business we call this..
As we all already know from the top computer vision models, multi-modal models are significantly better than single-modal models, training on both texts and images greatly improves model performance because there is valuable information in the intersection of both.
Training on both improves each separately.
How is it on text?
After playing with it all night, I can say that the answers of are more or less the same as ChatGPT on day-to-day usage. simply because ChatGPT is already just that good.
Seriously, in most cases: How is it even possible to answer "better" than how ChatGPT already answers?
But!? And there is a huge "But".
As your task becomes more and more difficult, the differences between the models become clear:
Note: In Hebrew it is significantly better than ChatGPT. But still not perfect.
Yes, but we already know that on GPT-3's paper some of the standard "measurements" were.. innovative. [1]
However, there was one particular metric that caught the attention of many.
Hindsight Neglect - A specific test in which larger models are worse. Except for GPT-4. Which gets 100% on it.
On this benchmark, The model is asked sometimes to forget some known fact, like "Let's set Pi to be 4" [As engineers do anyway], and then the model is tasked with solving some mathematical questions with the new Pi.
As it turns out, on this benchmark: As models get bigger they tend to also get worse.
How did they beat this test? I'm guessing they just created specially tailored synthetic data such that the model would be able to understand these types of problems much better.
Controlling the model's "Personality"
ChatGPT is rambling too much. right?
Well, this model has been specially trained so that its "personality" is controllable, it can be told in advance to answer in a specific way through "system commands" The same commands that control the model behind the scenes and were introduced with the ChatGPT API.
And this leads us to..
Conspiracy confirmed: GPT-4 == Bing!
Confirmation: Bing (Sydney ??) was GPT-4 all along. Conspiracy confirmed.
Training on long texts
The model supports up to 32,000 tokens which are around 50 Pages of human language.
In my opinion: The length of the input is one of the most critical aspects of the whole field: if you had access to a model of enormous length, you could simply "paste" an entire book into the prompt or even an entire database and ask it questions.
How did they do it?
Contrary to what many believe, transformers actually work really well if you train them with longer sequences. I do it every day.
So given that GPT-3 has the same architecture as GPT-2 except it doesn't have the same architecture and instead it is a Sparse Transformer.. [2]
Which was also upgraded with relative attention at some point.. [3]
..We can assume that it can handle longer texts just fine. So they probably just trained also on longer texts without making too many changes.
From experience: I can tell you that training on an entire book or in general on very long texts also dramatically improves the model even on shorter texts, Probably because it can also capture some longer dependencies that are only represented in very long texts such as full books.
I guess they probably just trained the model on a large number of longer texts as well. Nothing fancy.?
How is it that GPT-4 knows so much?
Although we don't know much about the training data, there are some facts about data that we do know.
领英推荐
Not from the paper.
First of all, we know that OpenAI has added some "synthetic conversation samples" to the dataset [4] - Humans who simulate both sides of the conversation and show the model how a proper conversation should take place.
Additionally, we know from past OpenAI papers that they have a large pool of "instructions" coming from real people and that these instructions are fundamentally different from the open academic instructions datasets (Like FLAN).
We have some examples of such instructions [7]
Also, we know that many people with specific domain expertise in a wide range of areas had been employed by OpenAI in roles of "Model QA".?
Chemists and mathematicians for example would work closely with the model to correct it when it was wrong. These teams of domain experts probably "forged" the model in their areas of expertise.
Side Note: You can easily find many of them nowadays on Twitter sharing what was it like.
I am guessing that the specific areas in which the model underwent this type of "oversight" are somewhat similar to the areas in which the model passes the standard licensing tests.
The new model is much better with facts
It "invents" links and other unique texts significantly less than GPT-3.
GPT -4, like its predecessors, also has a tendency to "invent" information or produce incorrect content, this content may be harmful because humans sometimes rely excessively on these types of models.
This topic becomes more and more relevant as large models become more and more human-like and are integrated into people's daily lives. To measure the severity of this problem in GPT-4, the researchers constructed a set of automated and human assessments based on real-world data.
The model has been optimized in order to reduce these "inventions", among other things, by using data collected from previous models such as ChatGPT.
OK. But how did they actually do it?
Models tend to "invent" more in areas that they don't understand well enough.
In my opinion: OpenAI simply trained the models for much longer using much more accurate data.
I also suspect that during developement, once they encountered a few "inventions" they simply added the correct data they wanted the model to know about into the training set and tested again.
More interesting: As per OpenAI, most of the "knowledge" of the model comes from the Pretraining phase and not from the RLHF phase. The RLHF only teaches the model to follow instructions in human language and not the additional required knowledge. According to them, overtraining with RLHF even hurts the performance of the model.
I confirm this. I see the same on my experiments.
Compute Power
The number of GPUs used in the training of GPT-4 is unknown, but some estimates suggest that this number could be more than 15,000.
In May 2020, Microsoft announced that it had built a cluster with 10,000 GPUs for OpenAI, it is believed to be known as the one that trained GPT-3.
With the release of GPT-3.5 and ChatGPT in late 2022, it is likely that the number of GPUs has increased since then.
According to Morgan Stanley, it was estimated that GPT-4 finished training last August. [6]
This matches exactly with the date posted yesterday by OpenAI about the end date of GPT-4's training.
Morgan Stanley also estimated that GPT-5 is currently training on 25,000 GPUs, with most of these GPUs also being used for GPT-4. [6]
Loss prediction prior to training
An interesting part of the paper is the ability of OpenAI to predict the loss of the model based on the computational power invested in the model. loss of huge models like GPT-4 is the main metric for measuring these models' performance and given that training a model of this type costs millions of dollars, it is useful to know in advance whether the model would be "good".
By developing a large training infrastructure and scalable optimization methods for model training, the researchers were able to accurately predict the model's final loss and the model's future capabilities based on the results of training small models. In the case of GPT-4, the team was able to successfully predict the model's final loss and the model's outcome performance on the HumanEval benchmark.
Slightly Interesting: In some past scaling laws expressions used for calculating loss forecasts there is a constant component that cannot be reduced. Some people on the internet have already rushed to call this component "irreducible intelligence" and that a model that will reach this loss is "AGI".
Highlights from the paper
Another interesting topic that is important not to miss are the parts that the "Technical" paper did choose to include.
High quality data is the "secret". End of story.
How come only OpenAI can pull this off?
I would like to ask you a question, have you ever tried any open model in the interface of Huggingface that actually impressed you?
Me neither.
The Harsh Truth: If you were to try and use the open solutions (eg Huggingface's Trainer) to train some open GPT model (eg GPT-2), it just won't work.
They turn out "okay"ish.. kind off.. nothing more.
The reason for that is that a tremendous amount of tricks and optimizations go behind the scenes and are kept away from open papers to actually train such models to a truly impressive result.
This is regardless of the data it is trained on or the size of the model.
In the picture: GPT-5 will have a gazillion + 1 Parameters!
References:
[1] - "Creative" measurements in GPT-3's paper - https://twitter.com/suchenzang/status/1617093563061522432
[2] - GPT-3 Is a sparse transformer - https://paperswithcode.com/method/gpt-3
[3] - GPT-3 Upgraded to relative attention - https://arxiv.org/abs/2207.14255
[4] - ChatGPT was trained also on synthetic conversations - https://openai.com/blog/chatgpt
[5] - Microsoft builds 10,000 GPUs cluster: https://news.microsoft.com/source/features/ai/openai-azure-supercomputer/
[6] - Morgan Stanley estimates: https://www.reddit.com/r/MachineLearning/comments/tdytxf/d_gpt5_trained_on_25000_gpus/
[7] - Instructions collected by OpenAI: https://github.com/openai/following-instructions-human-feedback/tree/main/automatic-eval-samples
CTO at CTERA · Inventor, 40+ Patents · Advisor · Generative AI · Cybersecurity · Cloud Storage
2 年Great writing , thank you Yam.
?? Microsoft Excel VBA Expert| Microsoft Office Cybersecurity Expert| Founder of Excel Armor| Product Manager | Civil Engineer| GIS specialist
2 年Aron Brand