?? Comparing LLAMA 3.1 and GPT-4o: Summarization Benchmark Results
Aayush Agrawal
Co-Founder | Competitor Intelligence for Gaming | Fraud Detection for FSI | Gen AI Automation for Law Firms | Outcomes as a Service using AI
My team & I have been very excited since the release of Llama 3.1 by Meta this week and we wanted to get our hands on the new model ASAP.
On most fronts, it seems like Llama is a very well trained model and it beats or rivals the performance of ChatGPT 4o on most benchmarks. Since the LLAMA3.1 announcement documentation did not mention any summarisation related benchmark such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or BLEU (Bilingual Evaluation Understudy), we wanted to compare the performance of the recently released LLAMA3.1 with that of the latest model by OpenAI (i.e. GPT-4o).
Test Design
In order to make the comparison fair, we took the following 3 financial news articles:
We decided to perform the test based on the following parameters:
#Formula to calculate ETD
Entity-token density = Number of Non-Cardinal Named Entities / Number of Tokens.
Results:
As soon as the results came in, we were blown away! (see the boxplot below). The box plot for Llama 3.1 is so widely spread out, whereas that of GPT 4o is smaller. The horizontal black line within the box represents the median value for those boxes. We see while that the median value is highest for OpenAI’s GPT-4o, however, the maximum value of density was seen in LLAMA-3.1-450B.
Below is the table showing Stats of the metric ETD for each Model over the 9 summarizations it performed.
领英推荐
From the table above, it can be inferred that Llama 3.1-405B has the highest standard deviation indicating a higher degree of variance in performance for different iterations. Whereas ChatGPT-4o having the highest Median value indicates a better performance across a majority of the iterations.
Comparison across LLMs by Articles
We also observed that the actual results vary widely for each article.
Based on this initial test, we see that OpenAI-GPT-4o produces more consistent results than LLAMA-3.1-405B, as seen from the standard deviation of ETD values.
Based on the current observations, it is difficult to pick any one of the models as the ‘best’ one, so we will further test these models under various types of documents.
Which of the two models do you think is better? Share your views in comments below!
I just saw your post while doing research! I'm also working on summarization tasks with using llms. So I want to share some of my experiences, maybe we can do information exchange :) Do you think that for evaluation it is enough to compare only ETD values? In my opinion, one approach can be ; the number of articles needs to be increased (at least 100) and there should be reference summaries for also seeing the other evaluation metrics like BertScore, Bleu etc . It might be time consuming I agree with that, so if the expected outputs are just abstractive text summary without formatting, the other approach can be using an annotated open source dataset for comparing the models' performance