Ahead of AI #8: The Latest Open Source LLMs and Datasets

Ahead of AI #8: The Latest Open Source LLMs and Datasets

The theme of this month's issue is to organize the almost overwhelming number of large language models (LLM) projects and datasets shared in the last couple of weeks. In addition to highlighting insights from large training runs, I will also share several curated lists to keep track of it all.


PS: If the newsletter appears truncated or clipped, that's because some email providers may truncate longer email messages. In this case, you can access the full article at https://magazine.sebastianraschka.com.


Articles & Trends

With so many research papers coming out this month, it's hard to pick a few favorites for a closer discussion. I personally prefer papers that offer extra insights rather than simply presenting more powerful models. In light of this, one paper that captured my attention was Eleuther AI's Pythia paper.

Pythia: Interpreting Autoregressive Transformers Across Time and Scale

The open source Pythia suite of LLM models is a really interesting alternative to other autoregressive decoder-style (aka GPT-like) models.

The accompanying Pythia: A Suite for Analyzing Large Language Models Across Training and Scale paper recently revealed some interesting insights into the training mechanics alongside introducing various LLMs ranging from 70M to 12B parameters.

Here are some insights and takeaways from the paper:

  1. Does training on duplicated data (i.e., training for >1 epoch) make a difference? It turns out that deduplication does not benefit or hurt performance.
  2. Does training order influence memorization? Unfortunately, it turns out that it does not. "Unfortunately," because if this was true, we could mitigate undesirable verbatim memorization issues by reordering the training data.
  3. Does pretrained term frequency influence task performance? Yes, few-shot accuracy tends to be higher for terms that occur more frequently.
  4. Doubling the batch size halves the training time but doesn't hurt convergence.

The Pythia model architecture is similar to GPT-3 but includes some improvements, for example, Flash Attention (like LLaMA) and Rotary Positional Embeddings (like PaLM). Pythia was trained on the Pile dataset (an 800GB dataset of diverse texts) for 300 B tokens (~1 epoch on regular Pile, ~1.5 epochs on deduplicated Pile).

No alt text provided for this image
Annotated figures from the Pythia paper (https://arxiv.org/abs/2304.01373).

By the way, the recent instruction-finetuned open-source Dolly v2 LLM* model used Pythia as the base (foundation) model, and I expect more models building on this in the future.

(*Dolly v2 is a finetuned Pythia 12B parameter model using the databricks-dolly-15k dataset to replicate the InstructGPT paper / ChatGPT. More on this dataset below.)

Open Source Data

Last month was a particularly exciting month for open-source AI with several open-source implementations of large language models (LLMs). Now we are also seeing a big wave of open-source datasets! Databricks Dolly 15k and Open-Assistant Conversations for instruction finetuning, and RedPajama for pretraining. This is particularly commendable because data collection and cleaning make up ~90% of a real-world machine learning project, but no one likes doing this work.

RedPajama Dataset for Pretraining

The RedPajama dataset is an open-source dataset for pretraining LLMs similar to LLaMA Meta's state-of-the-art LLaMA model. The goal of this project is to create a capable open-source competitor to the most popular LLMs, which are currently closed commercial models or only partially open. (PS: The name RedPajama is inspired by the children book Llama Llama Red Pajama.)

The large bulk of the dataset consists of CommonCrawl, which is filtered for websites in English language, but the Wikipedia articles cover 20 different languages.

No alt text provided for this image
The number of tokens in the RedPajama dataset that aims to reproduce the LLaMA pretraining datasets (https://github.com/togethercomputer/RedPajama-Data).


Databricks-Dolly-15

Databricks-dolly-15k is a dataset for LLM finetuning that features >15,000 instruction-pairs written by thousands of DataBricks employees (similar to those used to train systems like InstructGPT and ChatGPT).

OpenAssistant Conversations

OpenAssistant Conversations is another dataset for finetuning pretraining LLMs on a collection of ChatGPT assistant-like dialogues that have been created and annotated by humans, encompassing 161,443 messages in 35 diverse languages, along with 461,292 quality assessments. These are organized in more than 10,000 thoroughly annotated conversation trees.

LongForm Dataset

The LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction paper introduces a collection of human-authored documents from established corpora like C4 and Wikipedia along with instructions for these documents to create a instruction-tuning dataset suitable for long text generation.

Alpaca Libre

The Alpaca Libre project aims to reproduce the Alpaca project with 100k+ MIT-licensed demonstrations from Anthropics HH-RLHF repo converted into an "Alpaca compatible format".

Open Data And Copyrights

Note that the data use of LLMs is still a hot topic of discussion, as summarized by the following articles below:

Extending Open-Source Datasets

Instruction finetuning is how we get from GPT-3-like pretrained base models to more capable LLMs like ChatGPT. And open-source human-generated instruction datasets like databricks-dolly-15k (above) can help make this possible. But how do we scale this further? Possibly without collecting additional data? One way is bootstrapping an LLM off its own generations.

While the Self-Instruct method was introduced five months ago (old by today's standards), it's still a super interesting approach to highlight since it's an almost annotation-free way to align pretrained LLMs with instructions.

How does this work? In a nutshell, it's a 4-step process:

  1. Seed task pool with a set of human-written instructions (175 in this case) and sample instructions;
  2. Use a pretrained LLM (like GPT-3) to determine the task category;
  3. Given the new instruction, let a pretrained LLM generate the response;
  4. Collect, prune, and filter the responses before adding it to the task pool.

No alt text provided for this image
Annotated figure based on the Self-Instruct paper (https://arxiv.org/abs/2212.10560).


In practice, this works relatively well based on the ROUGE scores. For example, a Self-Instruct-finetuned LLM outperforms the GPT-3 base LLM (1) and can compete with an LLM pretrained on a large human-written instruction set (2). And self-instruct can also benefit LLMs that were already finetuned on human instructions (3).

But of course, the gold standard for evaluating LLMs is to ask human raters. Based on human evaluation, Self-Instruct outperforms base LLM, and LLMs trained on human instruction datasets in supervised fashion (SuperNI, T0 Trainer). But interestingly, Self-Instruct does not outperform methods trained via reinforcement learning with human feedback (RLHF)

Human-generated vs synthetic training data

Which is more promising, human-generated instruction datasets or self-instruct datasets? I vote for both. Why not start with a human-generated instruction dataset like the 15k instructions from databricks-dolly-15k and then scale this with self-instruct? For example, the recent Synthetic Data from Diffusion Models Improves ImageNet Classification paper showed that combining real image training sets with AI-generated images can improve model performance. Wouldn't be interesting to investigate whether this is true for text data?

A recent example that goes in this direction is the Better Language Models of Code through Self-Improvement paper. The researchers found that a pretrained LLM can improve code-generation tasks if it utilizes its own generated data.

Less is more?

Also, in addition to pretraining and finetuning models on increasingly larger datasets, how about improving efficiency on smaller datasets? In the just-released Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes paper, researchers propose a distillation mechanism to curate task-specific smaller models that exceed the performance of standard finetuning with less training data.


No alt text provided for this image
Screenshot from Distilling Step-by-Step paper (https://arxiv.org/abs/2305.02301).

Keeping Track of Open-Source LLMs

The number of open-source LLMs is exploding, which is, on the one hand, a very nice development (versus gating the models via paid APIs). On the other hand, however, it can sometimes be tricky to keep track of it all. Below are four resources that provide different summaries of the most relevant models, including their relationship, the underlying datasets, and the various licensing information.

(On an interesting side note, LAION and other prominent researchers shared an open letter ("A Call To Protect Open-Source AI In Europe") intended for the European Parliament.)

Let's start with the Ecosystems Graphs website based on the paper Ecosystem Graphs: The Social Footprint of Foundation Models, which features a table (screenshot below) and an interactive dependency graph (not shown).

No alt text provided for this image
Screenshot of the top portion of the ecosystems graph table from https://crfm.stanford.edu/ecosystem-graphs/.

This ecosystems graph has the most comprehensive listing I've seen so far. But since many less popular LLMs are included, it can appear a bit cluttered. Also, based on checking the accompanying GitHub repo, it has been updated for at least a month. Also, it is unclear if newer models will be added (but this is true for the two more recent resources below as well).

The second resource is the beautifully drawn evolutionary tree from the recent Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond paper, which focuses on the most popular LLMs and their relationship.

No alt text provided for this image
Figure from https://arxiv.org/abs/2304.13712.

While the LLM evolutionary is a very nice and very clear visualization, there are a few small caveats. For example, it's unclear why the root doesn't start with the original transformer architecture from Attention Is All You Need. Also, the "open source" labels are not super accurate. For example, LLaMA is listed as open source, but the weights are unavailable under an open-source license (only the inference code is).

The third resource is a table by my colleague Daniela Dapena from the blog post The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs …

No alt text provided for this image
Table from [The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs …(https://lightning.ai/pages/community/community-discussions/the-ultimate-battle-of-language-models-lit-llama-vs-gpt3.5-vs-bloom-vs/)

While this table is a bit smaller than the other resources, what's nice about this resource is that it includes the model sizes and licensing information, which has useful practical implications if you plan to adopt these models for any of your projects.

Fourth, the LLaMA-Cult-and-More overview table provides additional information about the finetuning methods as well as hardware costs.

No alt text provided for this image
Truncated table from https://github.com/shm007g/LLaMA-Cult-and-More/blob/main/chart.md.


Next, it would be nice to somehow include modeling performance information in these tables somehow, but this is a whole other can of worms, since evaluating LLMs efficiently is not very straightforward ...

Evaluating LLMs

Speaking of evaluation, two comprehensive efforts in this direction are the Flan-Eval: Reproducible Held-Out Evaluation for Instruction Tuning and Language Model Evaluation Harness repositories.

Moreover, I've seen several projects employing GPT-4 for model evaluation, including the new LLaMA-v2 Adapter paper. This GPT-4 evaluation method asks GPT-4 which of the two outputs of two LLMs it prefers. (Note that the LLaMA-Adapter V2 authors mentioned in a Discord chat that this method is biased toward the first response. And they are working on addressing this issue by feeding in the response pairs with reversed orders as well.)

LLaMA-Adapter V2

Remember LLaMA-Adapter as a nice parameter-efficient finetuning technique for large language models (LLMs) last month? Since I also predicted that we'd be seeing more multi-modal LLM models this month, we have to talk about the freshly released "LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model"!

As a brief recap, what's LLaMA-Adapter, and how does it work? LLaMA-Adapter is a parameter-efficient LLM finetuning technique that modifies the first few transformer blocks and introduces a gating mechanism to stabilize the training.

Using the LLaMA-Adapter approach, researchers were able to finetune a 7 billion parameter LLaMA model in only 1 hour (using eight A100 GPUs) on 52k instruction pairs. Although only the newly added 1.2 M parameters (the adapter layers) were tuned, the 7B LLaMA model remains frozen.

The focus of LLaMA-Adapter V2 is multimodality, that is, building a visual instruction model that can also talk about image inputs. (The original V1 could also receive image tokens (together with text tokens), but this aspect remained underexplored.)

They added three main tricks from V1 to V2 to improve their adapter approach.

  1. Early fusion of visual knowledge: Instead of fusing visual and adapted prompts in each adapted layer, they now concatenate the visual tokens with the word tokens in the first transformer block.
  2. Using more parameters: they unfreeze all normalization layers, and they add both bias units and scale factors to each linear layer in the transformer blocks.
  3. Joint training with disjoined parameters: for image-text captioning data, they only train the visual projection layers; for instruction-following data, they only train the adaption layers (and newly added parameters mentioned above)

LLaMA V2 (14 M) comes with a few more parameters than LLaMA V1 (1.2 M), but it is still lightweight, comprising only 0.02% of the total parameters of 65 B LLaMA model size.

What's particularly impressive is that by finetuning only 14 M parameters of the 65B LLaMA model, the finetuned LLaMA-Adapter V2 model matches ChatGPT in performance (using a GPT-4 model for the evaluation). It also outperforms the 13B Vicuna model, which uses a full-finetuning approach (updating all 13B parameters).

Unfortunately, the V2 paper omits the computational performance benchmarks included in the V1 paper, but we can assume that V2 is still much faster than full finetuning.

No alt text provided for this image

Open Source Highlights

Lit-LLaMA improvements

At Lightning AI, we are maintaining the Lit-LLaMA repository that offers a simpler, more readable open-source reimplementation of the popular LLaMA model along with several finetuning methods such as low-rank adaptation and LLaMA-Adapter, quantization, and more.

The recent additions to this repository include the following:

Other open-source LLMs

It's impossible to list them all, but some of the notable open-source LLMs and chatbots launched this month include Open-Assistant, Baize, StableVicuna, ColossalChat, Mosaic's MPT, and more. Moreover, below are two multimodal LLMs that are particularly interesting.

OpenFlamingo

OpenFlamingo is an open-source reproduction of Google Deepmind's Flamingo model released last year. OpenFlamingo aims to offer multimodal image-reasoning capabilities for LLMs where people are able to interleave text and image inputs.

MiniGPT-4

MiniGPT-4 is another open-source model with vision-langauge capabilities. It's based on a frozen visual encoder from BLIP-27with a frozen Vicuna LLM.

No alt text provided for this image
MiniGPT-4 screenshot via https://github.com/Vision-CAIR/MiniGPT-4.

NeMo Guardrails

With all these new LLMs coming up, a lot of companies are wondering about how and if to deploy them. Especially when it comes to safety. There is no good solution, yet, but there's at least one more promising bandaid: Nvidia open-sourced a toolkit to address the hallucination issue of LLMs.

In a nutshell, how it works is that this method uses a database linking to hardcoded prompts, which have to be manually curated. Then, if a user enters a prompt, the prompt is first matched to the most similar entry in that database. And the Database then returns a hardcoded prompt, which is, in turn, passed on to the LLM. So, if someone carefully tests the hardcoded prompts, one can ensure that the interaction does not diverge from permissible topics and so on.

No alt text provided for this image
Outline of the NeMO guardrails concept.

It is an interesting but not groundbreaking approach as it does not give the LLM better or new capabilities. It merely restricts how far a user can interact with an LLM. Still, it might be a viable approach before researchers find alternative ways to mitigate hallucination issues and toxic behavior in LLMs.

The guardrails approach could also be combined with other alignment techniques, such as the popular Reinforcement Learning with Human Feedback training paradigm that I covered in a previous issue of Ahead of AI.

Consistency Models

Wouldn't it be nice to talk about interesting models other than LLMs? OpenAI finally open-sourced the code for their consistency models: https://github.com/openai/consistency_models.

Consistency models are considered to be a viable, efficient alternative to diffusion models. More info in the Consistency Models paper.


New Deep Learning Fundamentals Units on Computer Vision and Large Language Models

Two new units of my free Deep Learning Fundamentals course just went live in the last couple of weeks.

Unit 7 covers computer vision models: how to work with image data, train convolutional networks, use transfer learning, and pretrain models with self-supervised learning.

Unit 8 covers large language models and natural language processing:

  • how to encode text data
  • understanding self-attention from the ground up
  • and finetuning large language models.

Of course, both lectures include many code examples and walkthroughs using PyTorch. The course is entirely free, and the videos are relatively short and bite-sized. I hope you find them useful!

No alt text provided for this image
Slide screenshot from Deep Learning Fundamentals Unit 8: Natural Language Processing and Large Language Models

Notable Quote

"I think we're at the end of the era where it's going to be these, like, giant, giant models. [...] We'll make them better in other ways." -- Sam Altman (CEO of OpenAI) in a Wired interview


Upcoming Events

Lightning AI and Stability AI are hosting a free Open Source AI meetup to chat about open source and AI on May 19th in NYC, showing off demos, and more. There is an RVSP link here. I am planning to attend, and if you are in town and planning to attend as well, don't hesitate to say Hi!

No alt text provided for this image



Thank you to those who have reached out asking how they can support Ahead of AI. While this newsletter is free and unabbreviated, there is a paid subscription option on Substack for those who would like to support it.

Anurag S.

Founder Director and Chief Digital Transformation Officer @ Digital Tranzform Private Limited | Interaction Design

1 年

As we moved from monolith to micro services in web dev, would we see LLM moving to sustainable MLM - micro language models?

回复
Nasir Uddin

PhD | Senior Data Scientist@Assa Abloy R&D | AI engineer | Machine Learning Specialist | Futurist

1 年

Excellent ??

回复

this was an awesome edition. do you want to come present at our next LLM in prod conference in June?

回复
Mohamed Eltayeb

Kaggle Competitions Master (Highest Rank: 236th in the World) | Data Scientist at JobHunt | KAUST Academy AI Alumni

1 年

Sebastian Raschka, PhD Regarding the Pytorch Lightning course, could you check quiz 1 in lecture 7.5 and 7.6? They do not show the answers properly.

要查看或添加评论,请登录

Sebastian Raschka, PhD的更多文章

  • Understanding Reasoning LLMs

    Understanding Reasoning LLMs

    Methods and Strategies for Building and Refining Reasoning Models In this article, I will describe the four main…

    94 条评论
  • Understanding Multimodal LLMs

    Understanding Multimodal LLMs

    It was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to…

    53 条评论
  • Building a GPT-Style LLM Classifier From Scratch

    Building a GPT-Style LLM Classifier From Scratch

    In this article, I want to show you how to transform pretrained large language models (LLMs) into strong text…

    54 条评论
  • New LLM Pre-training and Post-training Paradigms

    New LLM Pre-training and Post-training Paradigms

    The development of large language models (LLMs) has come a long way, from the early GPT models to the sophisticated…

    35 条评论
  • Instruction Pretraining LLMs

    Instruction Pretraining LLMs

    A lot has happened last month: Apple announced the integration of on-device LLMs, Nvidia shared their large Nemotron…

    39 条评论
  • LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments

    LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments

    This month, I am covering three new papers related to instruction finetuning and parameter-efficient finetuning with…

    15 条评论
  • How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

    How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?

    April 2024, what a month! My birthday, a new book release, spring is finally here, and four major open LLM releases:…

    26 条评论
  • Using and Finetuning Pretrained Transformers

    Using and Finetuning Pretrained Transformers

    This week has been filled with developments, including exciting new AI research that I’ll be discussing in my usual…

    19 条评论
  • Ahead of AI #12: LLM Businesses and Busyness

    Ahead of AI #12: LLM Businesses and Busyness

    In Ahead of AI, I try to strike a balance between discussing recent research, explaining AI-related concepts, and…

    13 条评论
  • Ahead of AI #11: New Foundation Models

    Ahead of AI #11: New Foundation Models

    Dear readers, The latest issue of Ahead of AI covers the recent and noteworthy developments around LLMs this summer:…

    2 条评论

社区洞察

其他会员也浏览了