The Missing Pieces of GPT-4

The Missing Pieces of GPT-4

The Missing Pieces of GPT-4

Some overlooked tricks & The future of open source LLMs.


> TL;DR: Reinforcement Learning from Inhuman (!!) Feedback.


The following text details the current state of open-source LLMs highlighting some details that seperate large scale commercial LLMs apart from smaller scale LLMs.


I gave an interesting talk yesterday with the same name ("The missing parts of GPT-4"): while we wait for the video & slides to be up, I wanted to share some of the most interesting details with all fo you guys.


*** Before I Start ***

I HAVE NO "SPECIAL"/"SECRET" KNOWLEDGE ABOUT GPT-4 WHATSOEVER.

All assumptions in this text can be wrong.

All info in this text comes from public sources or my own experimental results: I would be happy to be called out about any inaccurate info.

***




Part 1: Where all of this is going?

(My Opinion. Feel free to skip for the technical details)

----------


During OpenAI's visit to israel last week, one question stood out: [1]

"Could the open-source LLMs potentially match GPT-4's abilities without additional technical advances?..

Or

...Is there a secret souce in GPT-4 unknown to the world that set it apart from the other models?" - @ishaygreen

This is an amazing question!!


If Yes: It is very hard to admit publicly that GPT-4 use such a powerful secret that we should not even try to compete with it.

If Not: They have a BIG problem.

[The rest of the question: "Or am I wasting my time installing Stable-Wizard-Vicuna-30B?"]

***

Sutskever's answer: "..There will always be a gap between the open source models and the private models and this gap may even be increasing with time.."

***

Right.

Here's What's Gonna Happen:

This will end up like Stable diffusion vs DALL-E 2. [2]

-

You would always be able to scale LLMs to infinity if you really want to do so.

But smaller LLMs are about to become so powerful that it will make no sense.

-

Reason:

They keep improving in an incredible rate every single day.

Yes, they are not perfect yet.

Yes, some measurements are world class p-hacking.

But look at the trend. [And the speed]

***

[1] - https://youtube.com/watch?v=mC-0XqTAeMQ [26:30]

[2] - See an interesting discussion about this on @abacaj's tweet.


Part 2: LLaMA's Impact

You already know about LLaMA.

Meta's powerful LLM, released open-source on february.

Ever since then, the model is at the center of a global effort to replicate the capabilities of the comercial LLMs openly.

Thereby providing millions of people all over the world an access to this technology.

There are tens of thousands [1] of people from start-ups, government organizations, academic institutions and even private individuals contributing to this effort.

Thank you to the good people of Meta!! ??

This was an incredible gift to the world: Giving us all a sit at the table for the first time.

[1] - Number of users on reddit, discord, twitter..

***

Where are we now?

There are 3 steps for training "ChatGPT"like models:

  • Pretraining.
  • Fine-Tuning.
  • Reinforcement Learning.



* - ! DO IT YOURSELF ! - *

The "common knowledge" nowadays is that all of this is complex.. intimidating.. expensive.. It is not.

Starting from a base-model, you can train LoRA adapters for all 3 stages (Oftentimes even for free! (colab))

  1. Try it yourself! 20B params. Free. Here: https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k
  2. Preatraining: Check Huggingface's run_clm: https://huggingface.co/learn/nlp-course/chapter7/6
  3. Fine-Tuning: Check Vicuna's code: https://github.com/lm-sys/FastChat
  4. Reinforcement Learning: Check TRLx: https://github.com/CarperAI/trlx

TRLx by @carperai: In my opinion this is the best open-source RL package nowadays. I use it myself and it is very effective (and fast).

* - ! DO IT YOURSELF ! - *



Part 3: Pretraining

---------


This is where 99% of the compute (and cost) go. When you hear things like "4 million dollars to train..", it usually refers to this step.

Unfortunatly, this is usually out-of-reach for most people for obvious reasons.


What do we know about this step?

  • "Capabilities on exams appear to stem primarily from the pre-training" - GPT-4's paper tells us that the pretraining step is where most capabilities come from.
  • "Deduplicating Training Data Makes LLMs Better" - Literally a name of a paper by Google. Listen to it, it is 200% true.
  • "Repeating only a small fraction of the data.. ..can damage performance significantly" - Scaling Laws and Interpretability of Learning From Repeated Data, telling us the same thing.

***

FalconLM

By UAE's Technology Innovation Institute.


The technology innovation institute stunned the world two weeks ago when they unveiled one of the the most powerful base models ever: FalconLM.

The model held the first place on Huggingface's leaderboard by a considerable margin for some days.

And that's not all: It was released under a completely open license: commercial use is also allowed.

-

?? Another bomb was dropped days ago ??

The UAE have started working on a monstrous base model: FalconLM with 180 billion parameters, Using the massive 5-trillion tokens dataset they collected and cleaned.

We don't have confirmed details about any of this yet but.. wow..

-

Important!

"Who will win? GPT-4 or Falcon 180B?".

There is no reason to compare them at all - we are exposed to GPT-4 only after the next steps of the training.

This model (if true at all) is at an earlier stage: Base model.

.

BUT!

If you look at GPT-4's paper: GPT-4's base model's performance metrics are known.

And they are not "that far off" from GPT-3's base model. Base on these numbers: The main difference between GPT-3 & GPT-4 is in the RL step.

Noted.

.



Apart of LLaMA we are starting to see another trend:

Organisations are Funding & Releasing powerful base models.

(MosiacML's MPT-7B, UAE's FalconLM, Meta's LLaMA..)

This trend is absolutely a threat to GPT-4.



Part 4: Fine-Tuning

---------


TL;DR: No magic. You train the same as pretraining but on "solved tasks": Task followed by the Solution. [Question > Answer, for example]


One of the most important discoveries by the open-source community was that this also works with synthetic data.

Meaning: Which reduces the cost of collecting high quality datasets significantly.


From this point forward the speed of improvements had been incredible:

  • 2 Weeks after LLaMA: Alpaca - TL;DR: "We can train poweful models on synthetic data"
  • 3 Weeks after LLaMA: Vicuna - TL;DR: "We can measure our LLMs by using other LLMs"
  • 5 Weeks after LLaMA: Wizard - TL;DR: "We can ask LLMs to make data samples harder to get MUCH better LLMs"
  • 9 Weeks after LLaMA: Guancano - TL;DR: "We can all now train huge models."
  • 14 Weeks after LLaMA: Orca - TL;DR: "GPT-4 prefers an open model over GPT-3.5" [This means nothing on it's own but this happening for the first time is a milestone]



Days Ago: An open model defeated GPT-4 on a logic puzzle written by.. GPT-4.

Yes. on ONE specific example. CALM DOWN.

This is the first anecdotal example I saw, I thought I'd share.

Because this is not going to be the last.


The best model: Nous-Hermes.

---------

Nous-Hermes | Among the top LLMs to date.

A particularly powerful model was released last week.

- Model: https://huggingface.co/NousResearch/Nous-Hermes-13b

- Authors: @Teknium1, @karan4d

- Data Contributors: @Teknium1 , @karan4d , @NousResearch , @huemin_art , @winglian , @erhartford , @main_horse

- Funding: @RedmondIT

I Highly recommend to try it, I use it myself instead of ChatGPT for anything other than code or math (we will get to why code is "different" in a moment).

It is so good, that some people on reddit are already starting to say that on certain prompts it is "at the level of GPT-4".

Even @elonmusk commented about it.

And just to let you know: As far as I know, the authors are already gathering feedback from the users about specific topics to improve for the next iteration.

The secret is the data:

The dataset used for training this model is completely synthetic and specially generated and filtered to yield the most powerful model possible using several other models.

What did that paper say again? Fine-tuning is never going to work?

This model beats nearly the entire GPT-4-ALL leaderboard on the first place.


Part 5: What's missing?

---------

Before we move to RL, here are some of the capabilities we still need to work on:

  1. A stronger base model: Since most of the capabilites of LLMs come from the pretraining steps: A better base model would be great.

Knowledge:

  • Open Data Collection: A good tweet by @OfirPress highlighted some datasets that GPT-4 probably trained on: All of LibGen (4M+ books) | Some of Sci-Hub (80M+ papers) | All of GitHub | Parts of these data sources had been scraped and orginized into open datasets but not everything yet.
  • Conversations: I myself (like many others) have the entire reddit-pushshift database downloaded and I am currently running a model that automatically rephrase selected subreddits to be written in an instruction form. I suggest others to do the same!

Major Problem: Code.

Honestly, I don't know what is it about code that is so different.

I don't think it's a data issue but something else, the gap is "too wide" in my opinion: There is something we don't know about.

Guess: It might be RL "to pass unit tests".

Reasoning:

- No one stands a chance against WizardLM (Not even GPT-4 ).

WizardLM's "trick that makes data harder" or simillar improvements of the same idea (ORCA) running at massive scale is a great path forward.

Feeling: There is something we don't know.

In my opinion, even though ChatGPT 3.5 is already eating dust from some of the open models that bypass it on serveral benchmarks, I still think that it behaves "too well".

A Guess:

In my opinion, the architecture might incorporate a mixture-of-experts which allows training and integrating many abilities independently of each other.

I thought this is only useful for massive scale back when Google's paper intruduced this idea but then OpenAI also wrote about it last year I started to suspect.

Suggestion: Joining forces training a single model.

This type of training could be a significant breakthrough for open-source: We could try to join forces training experts independently of each other to than later on integrate them into a single model online that combining "the best" expert for each field.



Part 6: Do we need RL at all?

---------

  • Answer: Yes.
  • Reason: I just went to check.
  • Experiment: Same model: Trained with SFT | Trained with RL.
  • Result: Clearly RL yields better results. (Shown on the slides, I am waiting for the video as you are!)

Possible explanations:

@yoavgo wrote an amazing piece about the needs for RL when training LLMs.

You can read about it here: https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81

TL;DR:

  1. Exploration: RL allows the model to "not learn word by word" what we want.
  2. Train the model "what not to say" in addition to "what to say"
  3. Allows the model to condition it's output based on it's internal representation. Example: Train models to say "I don't know" if they don't know. [If you do this with SFT the model randomly say "I don't know".. not based on it's real knowledge"]


Part 7: Overlooked trick in GPT-4's paper

--------

TL;DR: Reinforcement Learning from Inhuman Feedback.

  • In RLHF: The model generates multiple options for completion and human chose the best one (or rank)..
  • Here, Instead: Let's just ask GPT-4 "which one is best?".

.

This reduces the effort (and cost) of RL to NOTHING.

And also somehow works better in practice when comparing to RLHF on human feedback datasets (HH).

I attempted to do this several month ago simply because.. "Where am I going to get data for RLHF now?"

Unknowingly that this is also mentioned in GPT-4's paper but on a different topic.

***

How it works:

  1. Ask another model a yes/no question.
  2. Take the probability of the token "yes" and the token "no"
  3. Reward = p_yes / (p_yes + p_no)
  4. Apply PPO's update step.


Just to wrap up:

  • I don't know how GPT-4 was trained.
  • I know that something is "different" about GPT-4 in comparison to everything else
  • Here is a VERY powerful method that is also hinted in GPT-4's paper.
  • So.. (and I might be wrong)

Regardless of GPT-4: I encourage you to try it yourself.

It is very easy to implement with TRLx and it works incredibly well.



In the picture: The speed of improvements: papers released after LLaMA's release.

要查看或添加评论,请登录

Yam Peleg的更多文章

  • ???? ????? ?????? ???? ?????

    ???? ????? ?????? ???? ?????

    ????? ??? ??????? ?-76 ?????? ?????, ??? ????? ??? ?? ???? ????? ?????? ???? ????? ????! ????? ???? ?????? ???????…

    4 条评论
  • ????? ????? ???? ????? ??????

    ????? ????? ???? ????? ??????

    ??? ??? ???? ?? ???? ???? ???? ????? ??????? ????? ?? ??? ??????. !???? ?? ??????? ????? ??????? ????? ???? ???? ????…

    11 条评论
  • Reading Between the Lines of OpenAI's Israel trip

    Reading Between the Lines of OpenAI's Israel trip

    OpenAI's management held several meetings yesterday in Tel Aviv. Those meetings were concluded by a highly publicized…

    3 条评论
  • GPT 4 - Reverse Engineering

    GPT 4 - Reverse Engineering

    Ah, Yes. GPT-4.

    3 条评论

社区洞察

其他会员也浏览了