Large Language Model ( LLM ) Trends
Mohammed Karimkhan Pathan
Senior Data Scientist | Manager - Projects | Data Science Consultant | Generative AI Expert
A large language model refers to a type of artificial intelligence algorithm that is capable of generating human-like text or completing natural language tasks, such as language translation or text summarization, based on large amounts of training data.
The latest literature on large language models, particularly those based on the transformer architecture, highlights their remarkable performance on a wide range of language-related tasks. For example, the GPT-3 model, which was released by OpenAI in 2020, has 175 billion parameters and has demonstrated impressive language generation and completion capabilities.
I will try to summarize the LLM trends explained by Sebastian Raschka
Large Language Models 1.0.
It's been about half a decade since we saw the emergence of the original transformer model, BERT, BLOOM, GPT 1 to 3, and many more. This generation of large language models (LLMs) peaked with PaLM, Chinchilla, and LLaMA. What this first generation of transformers has in common is that they were all pretrained on large unlabeled text corpora.
Large Language Models 2.0.
Recently, we have seen many pretrained LLMs being finetuned on labeled target data, either using reinforcement learning with human feedback or more classic supervised learning objectives, as discussed in the previous issue of Ahead of AI. Popular examples of these second-generation LLMs include InstructGPT and ChatGPT, as well as Alpaca and Bard (discussed in this newsletter)
Large Language Models 3.0.
It's interesting to think about what the 3rd generation of large language models looks like. Popular themes in recent months were parameter-efficient finetuning and pretraining on domain-specific data (examples are discussed later in this newsletter). However, parameter-efficient finetuning and pretraining on domain-specific data are ways to leverage LLMs more computationally and data-efficiently. Instead, the next generation of LLMs is likely centered around multimodal and multitask learning, bringing new capabilities to large language models. I expect to see more research in this direction in the upcoming months.
Extending LLaMA
Last month, Meta's LLaMA alternative to GPT-3 made big waves. As a testament to how quickly the AI research field is moving these days, there were already many projects building on top of LLaMA.
One of the notable projects is Alpaca. Alpaca is an instruction-finetuned 7B language transformer based on the 7B LLaMA model. However, instead of using reinforcement learning with human feedback, Alpaca takes a supervised approach using 52k instruction-output pairs.
Instead of using human-generated instruction-output pairs, the researchers retrieved the data by querying the GPT-3-based text-davinci-003 model. So, Alpaca essentially uses a form of weakly supervised or knowledge-distillation-flavored finetuning. Note that this can be competitive with human annotations. For example, in the Self-Instruct paper (https://arxiv.org/abs/2212.10560), the authors found that bootstrapping a model on its generations can result in performance competitive with InstructGPT.
The training recipe is?available on GitHub, and according to the authors, it can be replicated with 8 A100 GPUs and a ~$600 budget.
Note that the Alpaca website (crfm.stanford.edu/alpaca/) was recently taken down, but the project is still?available on GitHub. The reason why Alpaca was taken down is?summarized in this news article?-- it "can generate false information, propagate social stereotypes, and produce toxic language."
Multimodality Might Be the Next Big Thing: PaLM-E
What's next for large language models (LLMs)? Unfortunately, it's hard to tell. We've seen that pure text models are getting better and better. However, it is difficult to say whether we are approaching the limit of what pure text LLMs are capable of, whether with pretraining on general language corpora or finetuning on labeled target datasets.
PaLM?is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta's LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks.
I am pretty confident that we will see much more research and experiments on making pure text models perform better, for example, by increasing parameter or dataset sizes or improving the architecture and training techniques.
领英推荐
Nonetheless, it's interesting to think about other directions future LLM developments will focus us. For example, the next trend may be focused on extending the capabilities with vision, other modalities, and multitask training.
Last year, Google published?PaLM: Scaling Language Modeling with Pathways?-- a strong alternative to GPT. Now, about one year later,?PaLM-E: An Embodied Multimodal Language Model?was just released. So let's take a look at what that is all about.
The focus of PaLM-E was on robotic manipulation planning. But interestingly, the model also showed emergent abilities for visual question answering and captioning when trained on multimodel inputs. Of course, PaLM-E is not the first language model that also supports image inputs. However, what's interesting and novel here is that one or more images can be flexibly included in any part of the sentence.
It should be noted that PaLM-E continues to be trained as a solely decoder-based LLM, which autoregressively generates text completions based on a given prefix or prompt. So, how do they enable the input of state representations or images? Pretty simple: they pretrained networks to encode them into embeddings. For example, for images, they experiment with a 4B & 22B parameter vision transformer (ViT) to produce embedding vectors. These embedding vectors are then linearly projected to match the embedding dimensions of the word token embeddings.
During training, to form the multimodel sentences in the first place, they use special tokens (<img1>, <img2>, etc.), as shown in the figure above, that then get swapped with the embedded images (similar to how word tokens are embedded via embedding layers.)
The most interesting question is whether we should freeze the pretrained LLM and only train the ViT embedding network. Finetuning the LLM appears to be better, and co-training on the "full mixture" achieves more than double the performance. I.e., the big takeaway of this paper is that multitask training improves performance compared to training models on individual tasks.
Other Multimodal LLMs?(MLLMs)
Note that PaLM-E is, of course, not the first and only LLM supporting multiple modalities (aka?MLLM). Other recent or popular examples include
Is It Worth Training Large Language Models From Scratch?
Is it worth training your own large language model (LLM) on domain-specific data from scratch? Researchers at Bloomberg did just that and shared a detailed technical report describing the dataset, model configuration, and training procedure. In my experience, it makes sense if we want to apply LLMs to novel data sources (e.g., protein amino acid sequences, as?ProtBERT?and others demonstrated). But how about adjacent data like finance articles?
Let's discuss the model proposed in?BloombergGPT: A Large Language Model for Finance?in more detail. BloombergGPT is a 50-billion parameter language model for finance, trained on 363 billion tokens from finance data and 345 billion tokens from a general, publicly available dataset. For comparison, GPT-3 is 3.5x larger (175 billion parameters) but was trained on 1.4x fewer tokens (499 billion).
Of course, Bloomberg GPT outperformed other LLMs on finance-related tasks. Interestingly, it also still performed well on general language tasks. I would have loved to know whether a 2-stage pretraining or a domain-specific finetuning would have yielded even better performance on the domain-specific data. I presume the authors didn't carry out these experiments due to cost reasons.
This brings us to the next topic: What hardware was the model trained on? The model was trained on 64 x 8 A100 GPUs using AWS for ~53 days. Doing some napkin math assuming a discounted 1.1 per hour rate for an A100 GPU, that would be 1274 (hours) x 1.1 (dollars per hour) 64 x 8 (GPUs) = 700k, which doesn't sound too bad, given that the?LLaMA paper?reported $600k training costs. But the caveat is that it didn't include hyperparameter optimization and failed runs.
Why did the authors use an architecture with "only" 50 billion parameters since GPT-3 is 3.5x larger? That's easier to answer. They adopted the Chinchilla scaling laws and found this to be a good size given the available size of the finance data.
Is it worth (pre)training the LLM on the combined dataset from scratch? Based on the paper, the model performs really well in the target domain. However, we don't know whether it's better than a) further pretraining a pretrained model on domain-specific data or b) finetuning a pretrained model on domain-specific data.
Why didn't they explore finetuning or continuing the training based on existing LLMs such as BLOOM? Finetuning (via RLHF or supervised finetuning) may have been more challenging to automate. And maybe they preferred to avoid continuing training an existing model because they didn't like the 3.5x larger BLOOM model size due to the scaling laws.
However, if you want to use the combined pretraining approach, BloombergGPT delivers an excellently described blueprint, including detailed descriptions of the architecture, datasets, and hyperparameters.