A quick review of the PaLM model

A quick review of the PaLM model

The PaLM series of models are large language models developed by Google AI in late 2022. The first version, with 540 billion parameters, was the largest dense language model at the time of its release. The model's name reflects its development using Google's Pathways System, designed to train large language models across thousands of TPUs.

Recommendation:

Follow AI Coffee Break on YouTube, a channel that talks about the latest technical developments in large language models, machine learning, and more.

A quick summary of "boring" language models

Before this model we saw models like GLaM with 1.2 trillion parameters! But before you get impressed by this number, you should know that it is a sparse language model. That is, it activates only a subset of its parameters according to the task it is working on. This is in contrast to the working principle of Dense models, which activate all of their parameters in each propagation process. The GLaM model activates a total of 8% of its parameters in each propagation, which equates to only about 97 billion parameters, in order to reduce the computing burden. There are also popular sparse models of that period such as Wu Dao 2.0 with 1.75 billion parameters, and the Switch Transformers model with 1.6 billion parameters.

Dense models, on the other hand, activate all their parameters during each processing step. Examples include LaMDA (137 billion parameters), Gopher (280 billion parameters), and the largest dense model at that time, Megatron-Turing NLG (530 billion parameters) from Microsoft and Nvidia.

In the face of intense competition with Microsoft, Google launched PaLM, a dense model with 540 billion parameters, exceeding Megatron-Turing NLG by 10 billion parameters.


This is a great explanation of PaLM! Here's a revised version with some improvements for clarity and flow:

What is PaLM?

PaLM, a massive language model developed by Google AI in October 2022, comes in three sizes: 8B, 62B, and 540B parameters. These variations allow researchers to study how model size affects performance on various tasks.

Architecture: Familiar Ground with Key Tweaks

While PaLM's core architecture resembles previous decoder-based models like GPT, it incorporates several key modifications:

  • SwiGLU Activation Function: This function (introduced by Shazeer in 2020) improves performance compared to standard activation functions.
  • Parallel Layers: This modification alters the calculations within the fully connected feed-forward layer, allowing some computations to happen simultaneously (increasing speed by 15%). While there's a slight performance trade-off in smaller models (8B), this effect becomes negligible in larger models (540B).
  • RoPE Embeddings: This technique, proven effective by Su et al. (2021), enhances the model's handling of long text sequences.
  • Multi-Query Attention: Unlike standard attention with separate projections for each head, Multi-Query Attention shares query and key projections across all positions. This reduces computational cost.
  • Shared Input-Output Embeddings: The model uses the same weight matrix for the final prediction layer and the input layer, further reducing computational burden.
  • No Biases: No Biases are used in either Fully connected Feed Forward or LayerNorm layers. Of course, the main goal here is certainly to reduce the computational burden, but according to the paper they note that this contributes to increasing training stability for large models.
  • SentencePiece Vocabulary: PaLM utilizes SentencePiece with a 256k token vocabulary to accommodate a wide range of languages in the training data. This vocabulary is created from the training set itself for efficiency and is lossless (reversible), making it suitable for languages like Chinese and code.

PaLM training on TPUs

PaLM's training leveraged a staggering 6,144 Tensor Processing Units (TPUs) of version 4, a specialized hardware designed for machine learning tasks. This represents a significant leap in resources compared to previous models. For instance, Megatron relied on 2,240 A100 graphics processing units (GPUs), while Google's Gopher model used 4,096 TPU v3 units.

How the model was trained on these things and the rest of the details you can see in the video.


Discussion

Of course, the models in that period (and even now) were somewhat "boring". It is true that they achieved amazing progress and very advanced results according to various benchmarks, and they even surpassed human performance in some tasks, but without "real innovation", especially in terms of architecture, as they are all variables of the transformer architecture (Vaswani et al., 2017), which is why I've described as "boring" at the beginning of the article. The majority of currently dominant models result from scaling the model.

Improvements to the models in recent years are evident as follows:

1- Scaling the size of the models in both depth and width. This leads to improved performance, in a predictable manner according to the Power law (Kaplan et al., 2020).

2- The scale of the model in terms of the size of the dataset. More precisely; The number of tokens that the model sees. This leads to improved performance in a predictable manner according to the Power law.

3- Quality of the dataset. Although there are huge datasets available for training like Common Crawl, they involve a lot of noise and redundancy.

4- Increasing model capacity without increasing the computational cost through sparsely activated modules.

For the PaLM model, the focus was on 3 points: model size, data quality, and training efficiency. Of course, the model outperformed all previous language models in most tasks, whether that was with Fine tuning settings or N-shot learning settings.

One interesting thing about the model was the “discontinuous improvements”. Often scaling up from 62B to 540B gave a performance improvement similar to scaling up from 8B to 62B, which is consistent with a power law, but for some tasks there were “discontinuous improvements” meaning that scaling from 62B to 540B made a very large jump Compared to the improvement seen between 8B and 62B. This means that when the model reaches a certain scale, new capabilities suddenly begin to appear. The following diagrams illustrate the previous:


They note that “discontinuous improvements” appear very clearly in tasks such as English proverbs and Logical_sequence (a discontinuous improvement curve). This implies that certain capabilities of the model only emerge once a certain scale is reached. This makes sense because a task such as the English proverbs task needs Abstract thinking to understand complex metaphors.

On tasks such as navigate and mathematical_induction they observe very modest performance even when going from 62B to 540B, and this suggests that simply increasing the size of the model does not necessarily translate into better performance on all tasks.

After the first version of PaLM there was PaLM 2 which was an expansion of the previous model to 1.1 trillion parameters. This model was used in Bard, but has now been replaced by the Gimini model.

All Google Models (to my knowledge) after the first version of PaLM were closed source, following in the footsteps of its competitor OpenAI whose models were closed after GPT 2.

At the present time, 5/3/2024, and according to the results on most Benchmarks, the best language model in terms of performance (I am talking about performance regardless of cost or speed) is the Google Gimini 1.5 Ultra model. To my knowledge, the number of parameters or the size of the dataset is unknown, but I guess between 1.2 and 2 trillion, and it was released about a week ago. Of course, the model outperforms GPT-4 in most tasks.

The Llama 3 model was also released a few weeks ago with around 70 billion parameters, and from taking a quick look at its performance against Benchmarks, I expect there's a different job they've done beyond the four points mentioned above. I avoid commenting on the model or making comparisons at this time because the model is still in evaluating.

Finally, with the massive amount of data used to train these models, a crucial question arises: what about information leakage from benchmark test sets? In other words, is it extremely expensive to review large amounts of data to ensure no overlap exists between the test data and the training data used to evaluate these models? This concern, known as 'data contamination,' is a growing issue in the field.


Of course, the paper is very large, about 83 pages, and for this reason it is boring to cover everything, but through this article + video you can get the conclusion.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了