?? AI. Breaking News! ?? Meta Launches Llama2 Base Model ??: A Deep Dive into Evaluation Metrics (MMLU) ????
Contenido Imagen Sintética Mixta.

?? AI. Breaking News! ?? Meta Launches Llama2 Base Model ??: A Deep Dive into Evaluation Metrics (MMLU) ????


July 19, 2023

Summary: This news is important because

  1. This model belongs to what is referred to as foundation models or base language models, which serve to generate other specialized or fine-tuned models in various disciplines.
  2. Its license allows for free use for both research and commercialization.


Meta (Facebook for the friends) has publicly launched its second generation LLM: LLAMA 2 (Large Language Model Meta AI). This general-purpose base language model comes onto the market not only to compete with ChatGPT, Bard, Claude, and the whole row of OpenSource models, but also as an opportunity available for both researchers and developers wishing to monetize their products.

These large models, like Llama, Falcon, Chinchilla, Palm, MPT, and Bloom, have an extremely high training cost (the training is very long, they require much more data, and substantial financial investments for their training). We can talk about months and thousands of dollars.

They are trained with huge amounts of unlabelled text through self-supervised learning and stand out in a wide range of tasks. Their capacity is enormous because they handle billions of parameters and are great opportunities for different fields.

Improvements

  • 40% increase in training tokens.
  • Security improvements through Human Learning Reinforcement training (RLHF).
  • Longer context window (4000 tokens).
  • 70 Billion parameters.

Meta provides the 7B, 13B, and 70B models, but this time it has not shared all the exact data from its training. Although they have clarified that data from Meta's products or services have not been used.

Where do I test it? Here ??

Download weights, fill out the form with terms of use and license. Here ??


Another summer batch of little goats, llamas, vicunas is coming ...2

No hay texto alternativo para esta imagen
Contenido Imagen Sintetico Bing


2. THAT LITTLE PROBLEM WITH METRICS.

One of the significant issues with the release of Llama-2 is the performance evaluation metrics. I've prepared a chart for you so you can see how Llama significantly stands out in its version 2 especially in larger models.

No hay texto alternativo para esta imagen
Benchmark Precisión Modelos según los datos facilitados por Meta. Gráfica elaboración propia IA-ismo


But let's delve a bit more into this topic, How do these metrics evaluate? What exactly do they evaluate? Why are there so many different metrics and why do some models use some and others use others?

Given that the topic at hand is broad and considering we are in full summer, we will focus, this time, exclusively on the first benchmark, one of the most known: MMLU. (it's its turn)

MMLU ("Massive Multitask Language Understanding").

It is a set of question and answer data used mainly to measure understanding and general world knowledge.

This data set was manually compiled and contains 15,908 questions obtained from publicly available sources. The questions were grouped into 57 different tasks ranging from basic math to U.S. history, computer science, law, among others. Organizing at different levels of complexity, from basic to professional.

They are multiple-choice questions with a single answer. Test.

The evaluation is done using few-shot approaches, i.e., 2 to 5 examples are given to the model before evaluating it, or a zero-shot approach is used, where the questions are simply presented for it to choose the answer without receiving any example of how to do so.

In relation to this metric, there is the possibility that the model has had partial access to the questions in the data set. For this reason, the authors include a list of 123 websites belonging to 60 domains that could have contaminated the model. Since no further explanations are provided, we did some research and found out that these websites are tracked and part of Common Crawl's C4 dataset, one of the datasets used to train GPT3, Llama, and possibly GPT4.

If a language model has seen the questions during its training before, it could have memorized them.

However, it is important to note that the authors of the MMLU do not provide specific statistics about what proportion of questions could be part of the datasets. We made a visualization of some links and found that some of them contain 250 questions, while others contain 304, 20 or 30 questions. This could indicate a wide average contamination range that could range from 7% to 39%. Having more precise data on this proportion would have been of great interest, as it would allow us to evaluate with greater probability the impact of artificial scoring on the results.

No hay texto alternativo para esta imagen
Proportion Contaminated URLs by domain in MMLU dataset (C4)

However, the authors of the MMLU argue that there is no indication that the models generalize on the dataset as there is no positive correlation between accuracy and average entropy in generating answers.

For this, they compare this relationship between a zero-shot where the model also receives no answer and a few-shot and obtain ratios of r=-043 and r=0.56 for each case.

These results are quite weak, especially considering they are multiple-choice questions, it would have been interesting to explore different approaches and change the order of the solutions to evaluate whether the model memorized only options A, B, C, D, or did some sort of more advanced reasoning.


We have just started to address this topic and it is already evident that it is necessary to establish unification criteria that allow us to evaluate the models more seriously and rigorously. Proposals such as the HELM (Holistic Evaluation of Language Models) benchmark show great potential in this regard.

As we progress in the development and application of artificial intelligence models, it is crucial to continue reflecting on these critical issues and work together to establish more unified and meaningful evaluation criteria.


And don't forget, if you like this content and want to support us so we can keep bringing you more interesting information, you can buy us a symbolic coffee on Ko-fi! ????


[1]

#AI #ArtificialIntelligence #News #BaseModel #Llama2 #Meta #EvaluationMetrics #MMLU #Innovation #Technology #AIUpdates #BigData #MachineLearning #NaturalLanguageProcessing #EthicalAI #ExplainableAI #InnovativeAI

要查看或添加评论,请登录

社区洞察