?? AI. Breaking News! ?? Meta Launches Llama2 Base Model ??: A Deep Dive into Evaluation Metrics (MMLU) ????
Alicia Colmenero Fernández
Master Artificial Intelligence. EdenAIs .
July 19, 2023
Summary: This news is important because
Meta (Facebook for the friends) has publicly launched its second generation LLM: LLAMA 2 (Large Language Model Meta AI). This general-purpose base language model comes onto the market not only to compete with ChatGPT, Bard, Claude, and the whole row of OpenSource models, but also as an opportunity available for both researchers and developers wishing to monetize their products.
These large models, like Llama, Falcon, Chinchilla, Palm, MPT, and Bloom, have an extremely high training cost (the training is very long, they require much more data, and substantial financial investments for their training). We can talk about months and thousands of dollars.
They are trained with huge amounts of unlabelled text through self-supervised learning and stand out in a wide range of tasks. Their capacity is enormous because they handle billions of parameters and are great opportunities for different fields.
Improvements
Meta provides the 7B, 13B, and 70B models, but this time it has not shared all the exact data from its training. Although they have clarified that data from Meta's products or services have not been used.
Where do I test it? Here ??
Download weights, fill out the form with terms of use and license. Here ??
Another summer batch of little goats, llamas, vicunas is coming ...2
2. THAT LITTLE PROBLEM WITH METRICS.
One of the significant issues with the release of Llama-2 is the performance evaluation metrics. I've prepared a chart for you so you can see how Llama significantly stands out in its version 2 especially in larger models.
But let's delve a bit more into this topic, How do these metrics evaluate? What exactly do they evaluate? Why are there so many different metrics and why do some models use some and others use others?
Given that the topic at hand is broad and considering we are in full summer, we will focus, this time, exclusively on the first benchmark, one of the most known: MMLU. (it's its turn)
MMLU ("Massive Multitask Language Understanding").
It is a set of question and answer data used mainly to measure understanding and general world knowledge.
This data set was manually compiled and contains 15,908 questions obtained from publicly available sources. The questions were grouped into 57 different tasks ranging from basic math to U.S. history, computer science, law, among others. Organizing at different levels of complexity, from basic to professional.
They are multiple-choice questions with a single answer. Test.
The evaluation is done using few-shot approaches, i.e., 2 to 5 examples are given to the model before evaluating it, or a zero-shot approach is used, where the questions are simply presented for it to choose the answer without receiving any example of how to do so.
In relation to this metric, there is the possibility that the model has had partial access to the questions in the data set. For this reason, the authors include a list of 123 websites belonging to 60 domains that could have contaminated the model. Since no further explanations are provided, we did some research and found out that these websites are tracked and part of Common Crawl's C4 dataset, one of the datasets used to train GPT3, Llama, and possibly GPT4.
If a language model has seen the questions during its training before, it could have memorized them.
However, it is important to note that the authors of the MMLU do not provide specific statistics about what proportion of questions could be part of the datasets. We made a visualization of some links and found that some of them contain 250 questions, while others contain 304, 20 or 30 questions. This could indicate a wide average contamination range that could range from 7% to 39%. Having more precise data on this proportion would have been of great interest, as it would allow us to evaluate with greater probability the impact of artificial scoring on the results.
However, the authors of the MMLU argue that there is no indication that the models generalize on the dataset as there is no positive correlation between accuracy and average entropy in generating answers.
For this, they compare this relationship between a zero-shot where the model also receives no answer and a few-shot and obtain ratios of r=-043 and r=0.56 for each case.
These results are quite weak, especially considering they are multiple-choice questions, it would have been interesting to explore different approaches and change the order of the solutions to evaluate whether the model memorized only options A, B, C, D, or did some sort of more advanced reasoning.
We have just started to address this topic and it is already evident that it is necessary to establish unification criteria that allow us to evaluate the models more seriously and rigorously. Proposals such as the HELM (Holistic Evaluation of Language Models) benchmark show great potential in this regard.
As we progress in the development and application of artificial intelligence models, it is crucial to continue reflecting on these critical issues and work together to establish more unified and meaningful evaluation criteria.
And don't forget, if you like this content and want to support us so we can keep bringing you more interesting information, you can buy us a symbolic coffee on Ko-fi! ????
[1]