?? Comparing LLAMA 3.1 and GPT-4o: Summarization Benchmark Results

?? Comparing LLAMA 3.1 and GPT-4o: Summarization Benchmark Results

My team & I have been very excited since the release of Llama 3.1 by Meta this week and we wanted to get our hands on the new model ASAP.

On most fronts, it seems like Llama is a very well trained model and it beats or rivals the performance of ChatGPT 4o on most benchmarks. Since the LLAMA3.1 announcement documentation did not mention any summarisation related benchmark such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or BLEU (Bilingual Evaluation Understudy), we wanted to compare the performance of the recently released LLAMA3.1 with that of the latest model by OpenAI (i.e. GPT-4o).

Test Design

In order to make the comparison fair, we took the following 3 financial news articles:

  1. Article 1
  2. Article 2
  3. Article 3

We decided to perform the test based on the following parameters:

  1. Selected Models: OpenAI GPT-4o, LLAMA 3.1-70B, LLAMA 3.1-405B.
  2. Task: Summarize three financial news articles.
  3. Iterations: 3 iterations per document for each model.
  4. Summary Length: 250 words.
  5. Assumptions: Token Length = 4 characters, Only Non-Cardinal Named Entities to be used for ETD calculation
  6. Metric Used: Entity Token Density (Higher means more information in lesser words)

#Formula to calculate ETD
Entity-token density = Number of Non-Cardinal Named Entities / Number of Tokens.        

Results:

As soon as the results came in, we were blown away! (see the boxplot below). The box plot for Llama 3.1 is so widely spread out, whereas that of GPT 4o is smaller. The horizontal black line within the box represents the median value for those boxes. We see while that the median value is highest for OpenAI’s GPT-4o, however, the maximum value of density was seen in LLAMA-3.1-450B.

Box Plots showing ETD for each LLM for summarization task

Below is the table showing Stats of the metric ETD for each Model over the 9 summarizations it performed.

The table shows stats related to the Distribution of ETD for each model

From the table above, it can be inferred that Llama 3.1-405B has the highest standard deviation indicating a higher degree of variance in performance for different iterations. Whereas ChatGPT-4o having the highest Median value indicates a better performance across a majority of the iterations.

Comparison across LLMs by Articles

We also observed that the actual results vary widely for each article.

ETD values for each model and each article


  • For Article 1 - LLAMA-3.1-405B had the highest value of ETD.
  • For Article 2 -? GPT-4o had the highest value of ETD, however, the LLAMA-3.1-405B underperformed LLAMA-3.1-70B in terms of ETD.
  • For Article 3 - GPT-4o had the highest value of ETD.

Based on this initial test, we see that OpenAI-GPT-4o produces more consistent results than LLAMA-3.1-405B, as seen from the standard deviation of ETD values.

Based on the current observations, it is difficult to pick any one of the models as the ‘best’ one, so we will further test these models under various types of documents.

Which of the two models do you think is better? Share your views in comments below!


I just saw your post while doing research! I'm also working on summarization tasks with using llms. So I want to share some of my experiences, maybe we can do information exchange :) Do you think that for evaluation it is enough to compare only ETD values? In my opinion, one approach can be ; the number of articles needs to be increased (at least 100) and there should be reference summaries for also seeing the other evaluation metrics like BertScore, Bleu etc . It might be time consuming I agree with that, so if the expected outputs are just abstractive text summary without formatting, the other approach can be using an annotated open source dataset for comparing the models' performance

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了