The Missing Pieces of GPT-4
The Missing Pieces of GPT-4
Some overlooked tricks & The future of open source LLMs.
> TL;DR: Reinforcement Learning from Inhuman (!!) Feedback.
The following text details the current state of open-source LLMs highlighting some details that seperate large scale commercial LLMs apart from smaller scale LLMs.
I gave an interesting talk yesterday with the same name ("The missing parts of GPT-4"): while we wait for the video & slides to be up, I wanted to share some of the most interesting details with all fo you guys.
*** Before I Start ***
I HAVE NO "SPECIAL"/"SECRET" KNOWLEDGE ABOUT GPT-4 WHATSOEVER.
All assumptions in this text can be wrong.
All info in this text comes from public sources or my own experimental results: I would be happy to be called out about any inaccurate info.
***
Part 1: Where all of this is going?
(My Opinion. Feel free to skip for the technical details)
----------
During OpenAI's visit to israel last week, one question stood out: [1]
"Could the open-source LLMs potentially match GPT-4's abilities without additional technical advances?..
Or
...Is there a secret souce in GPT-4 unknown to the world that set it apart from the other models?" - @ishaygreen
This is an amazing question!!
If Yes: It is very hard to admit publicly that GPT-4 use such a powerful secret that we should not even try to compete with it.
If Not: They have a BIG problem.
[The rest of the question: "Or am I wasting my time installing Stable-Wizard-Vicuna-30B?"]
***
Sutskever's answer: "..There will always be a gap between the open source models and the private models and this gap may even be increasing with time.."
***
Right.
Here's What's Gonna Happen:
This will end up like Stable diffusion vs DALL-E 2. [2]
-
You would always be able to scale LLMs to infinity if you really want to do so.
But smaller LLMs are about to become so powerful that it will make no sense.
-
Reason:
They keep improving in an incredible rate every single day.
Yes, they are not perfect yet.
Yes, some measurements are world class p-hacking.
But look at the trend. [And the speed]
***
[1] - https://youtube.com/watch?v=mC-0XqTAeMQ [26:30]
[2] - See an interesting discussion about this on @abacaj's tweet.
Part 2: LLaMA's Impact
You already know about LLaMA.
Meta's powerful LLM, released open-source on february.
Ever since then, the model is at the center of a global effort to replicate the capabilities of the comercial LLMs openly.
Thereby providing millions of people all over the world an access to this technology.
There are tens of thousands [1] of people from start-ups, government organizations, academic institutions and even private individuals contributing to this effort.
Thank you to the good people of Meta!! ??
This was an incredible gift to the world: Giving us all a sit at the table for the first time.
[1] - Number of users on reddit, discord, twitter..
***
Where are we now?
There are 3 steps for training "ChatGPT"like models:
* - ! DO IT YOURSELF ! - *
The "common knowledge" nowadays is that all of this is complex.. intimidating.. expensive.. It is not.
Starting from a base-model, you can train LoRA adapters for all 3 stages (Oftentimes even for free! (colab))
TRLx by @carperai: In my opinion this is the best open-source RL package nowadays. I use it myself and it is very effective (and fast).
* - ! DO IT YOURSELF ! - *
Part 3: Pretraining
---------
This is where 99% of the compute (and cost) go. When you hear things like "4 million dollars to train..", it usually refers to this step.
Unfortunatly, this is usually out-of-reach for most people for obvious reasons.
What do we know about this step?
***
FalconLM
By UAE's Technology Innovation Institute.
The technology innovation institute stunned the world two weeks ago when they unveiled one of the the most powerful base models ever: FalconLM.
The model held the first place on Huggingface's leaderboard by a considerable margin for some days.
And that's not all: It was released under a completely open license: commercial use is also allowed.
-
?? Another bomb was dropped days ago ??
The UAE have started working on a monstrous base model: FalconLM with 180 billion parameters, Using the massive 5-trillion tokens dataset they collected and cleaned.
We don't have confirmed details about any of this yet but.. wow..
-
Important!
"Who will win? GPT-4 or Falcon 180B?".
There is no reason to compare them at all - we are exposed to GPT-4 only after the next steps of the training.
This model (if true at all) is at an earlier stage: Base model.
.
领英推荐
BUT!
If you look at GPT-4's paper: GPT-4's base model's performance metrics are known.
And they are not "that far off" from GPT-3's base model. Base on these numbers: The main difference between GPT-3 & GPT-4 is in the RL step.
Noted.
.
Apart of LLaMA we are starting to see another trend:
Organisations are Funding & Releasing powerful base models.
(MosiacML's MPT-7B, UAE's FalconLM, Meta's LLaMA..)
This trend is absolutely a threat to GPT-4.
Part 4: Fine-Tuning
---------
TL;DR: No magic. You train the same as pretraining but on "solved tasks": Task followed by the Solution. [Question > Answer, for example]
One of the most important discoveries by the open-source community was that this also works with synthetic data.
Meaning: Which reduces the cost of collecting high quality datasets significantly.
From this point forward the speed of improvements had been incredible:
Days Ago: An open model defeated GPT-4 on a logic puzzle written by.. GPT-4.
Yes. on ONE specific example. CALM DOWN.
This is the first anecdotal example I saw, I thought I'd share.
Because this is not going to be the last.
The best model: Nous-Hermes.
---------
Nous-Hermes | Among the top LLMs to date.
A particularly powerful model was released last week.
- Data Contributors: @Teknium1 , @karan4d , @NousResearch , @huemin_art , @winglian , @erhartford , @main_horse
- Funding: @RedmondIT
I Highly recommend to try it, I use it myself instead of ChatGPT for anything other than code or math (we will get to why code is "different" in a moment).
It is so good, that some people on reddit are already starting to say that on certain prompts it is "at the level of GPT-4".
Even @elonmusk commented about it.
And just to let you know: As far as I know, the authors are already gathering feedback from the users about specific topics to improve for the next iteration.
The secret is the data:
The dataset used for training this model is completely synthetic and specially generated and filtered to yield the most powerful model possible using several other models.
What did that paper say again? Fine-tuning is never going to work?
This model beats nearly the entire GPT-4-ALL leaderboard on the first place.
Part 5: What's missing?
---------
Before we move to RL, here are some of the capabilities we still need to work on:
Knowledge:
Major Problem: Code.
Honestly, I don't know what is it about code that is so different.
I don't think it's a data issue but something else, the gap is "too wide" in my opinion: There is something we don't know about.
Guess: It might be RL "to pass unit tests".
Reasoning:
- No one stands a chance against WizardLM (Not even GPT-4 ).
WizardLM's "trick that makes data harder" or simillar improvements of the same idea (ORCA) running at massive scale is a great path forward.
Feeling: There is something we don't know.
In my opinion, even though ChatGPT 3.5 is already eating dust from some of the open models that bypass it on serveral benchmarks, I still think that it behaves "too well".
A Guess:
In my opinion, the architecture might incorporate a mixture-of-experts which allows training and integrating many abilities independently of each other.
I thought this is only useful for massive scale back when Google's paper intruduced this idea but then OpenAI also wrote about it last year I started to suspect.
Suggestion: Joining forces training a single model.
This type of training could be a significant breakthrough for open-source: We could try to join forces training experts independently of each other to than later on integrate them into a single model online that combining "the best" expert for each field.
Part 6: Do we need RL at all?
---------
Possible explanations:
@yoavgo wrote an amazing piece about the needs for RL when training LLMs.
You can read about it here: https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81
TL;DR:
Part 7: Overlooked trick in GPT-4's paper
--------
TL;DR: Reinforcement Learning from Inhuman Feedback.
.
This reduces the effort (and cost) of RL to NOTHING.
And also somehow works better in practice when comparing to RLHF on human feedback datasets (HH).
I attempted to do this several month ago simply because.. "Where am I going to get data for RLHF now?"
Unknowingly that this is also mentioned in GPT-4's paper but on a different topic.
***
How it works:
Just to wrap up:
Regardless of GPT-4: I encourage you to try it yourself.
It is very easy to implement with TRLx and it works incredibly well.
In the picture: The speed of improvements: papers released after LLaMA's release.