登录查看更多内容

?? Comparing LLAMA 3.1 and GPT-4o: Summarization Benchmark Results

Aayush Agrawal

Co-Founder | Competitor Intelligence for Gaming | Fraud Detection for FSI | Gen AI Automation for Law Firms | Outcomes as a Service using AI

发布日期: 2024年7月24日

My team & I have been very excited since the release of Llama 3.1 by Meta this week and we wanted to get our hands on the new model ASAP.

On most fronts, it seems like Llama is a very well trained model and it beats or rivals the performance of ChatGPT 4o on most benchmarks. Since the LLAMA3.1 announcement documentation did not mention any summarisation related benchmark such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or BLEU (Bilingual Evaluation Understudy), we wanted to compare the performance of the recently released LLAMA3.1 with that of the latest model by OpenAI (i.e. GPT-4o).

Test Design

In order to make the comparison fair, we took the following 3 financial news articles:

We decided to perform the test based on the following parameters:

Selected Models: OpenAI GPT-4o, LLAMA 3.1-70B, LLAMA 3.1-405B.
Task: Summarize three financial news articles.
Iterations: 3 iterations per document for each model.
Summary Length: 250 words.
Assumptions: Token Length = 4 characters, Only Non-Cardinal Named Entities to be used for ETD calculation
Metric Used: Entity Token Density (Higher means more information in lesser words)

#Formula to calculate ETD
Entity-token density = Number of Non-Cardinal Named Entities / Number of Tokens.

Results:

As soon as the results came in, we were blown away! (see the boxplot below). The box plot for Llama 3.1 is so widely spread out, whereas that of GPT 4o is smaller. The horizontal black line within the box represents the median value for those boxes. We see while that the median value is highest for OpenAI’s GPT-4o, however, the maximum value of density was seen in LLAMA-3.1-450B.

Box Plots showing ETD for each LLM for summarization task

Below is the table showing Stats of the metric ETD for each Model over the 9 summarizations it performed.

CapeStart 2 个月前

What the Heck is GPT-3.5 Fine Tuning? ??

Arjun Kashyap 1 年前

"Strategic Moves Catapulted This GPT To The Top Of…

Orren Prunckun 9 个月前

The table shows stats related to the Distribution of ETD for each model

From the table above, it can be inferred that Llama 3.1-405B has the highest standard deviation indicating a higher degree of variance in performance for different iterations. Whereas ChatGPT-4o having the highest Median value indicates a better performance across a majority of the iterations.

Comparison across LLMs by Articles

We also observed that the actual results vary widely for each article.

ETD values for each model and each article

For Article 1 - LLAMA-3.1-405B had the highest value of ETD.
For Article 2 -? GPT-4o had the highest value of ETD, however, the LLAMA-3.1-405B underperformed LLAMA-3.1-70B in terms of ETD.
For Article 3 - GPT-4o had the highest value of ETD.

Based on this initial test, we see that OpenAI-GPT-4o produces more consistent results than LLAMA-3.1-405B, as seen from the standard deviation of ETD values.

Based on the current observations, it is difficult to pick any one of the models as the ‘best’ one, so we will further test these models under various types of documents.

Which of the two models do you think is better? Share your views in comments below!

Hilal ürün

2 周

I just saw your post while doing research! I'm also working on summarization tasks with using llms. So I want to share some of my experiences, maybe we can do information exchange :) Do you think that for evaluation it is enough to compare only ETD values? In my opinion, one approach can be ; the number of articles needs to be increased (at least 100) and there should be reference summaries for also seeing the other evaluation metrics like BertScore, Bleu etc . It might be time consuming I agree with that, so if the expected outputs are just abstractive text summary without formatting, the other approach can be using an annotated open source dataset for comparing the models' performance

要查看或添加评论，请登录

查看全部

?? Comparing LLAMA 3.1 and GPT-4o: Summarization Benchmark Results

Aayush Agrawal

Co-Founder | Competitor Intelligence for Gaming | Fraud Detection for FSI | Gen AI Automation for Law Firms | Outcomes as a Service using AI

Test Design

Results:

领英推荐

Comparison across LLMs by Articles

更多精彩文章

社区洞察

其他会员也浏览了

GenAI - The Cost of Context.

GPT-4 Turbo Key Updates

Introducing OpenAI’s o1-preview: The Next Big Leap in AI Reasoning

Mixtral-8x7B: Overview and Benchmarks with Combining Mixtral and Flash Attention 2

OpenAI's GPT-4o Mini: A fine line between performance and costs

Evaluating Anthropic: A Smart Switch for Superior CX?

Revolutionizing Custom AI: OpenAI's GPT-3.5 Turbo Fine-Tuning Unleashed!

Using the OpenAI GPT API: A Comprehensive Guide

Exploring the Performance and Capabilities of GPT Models: GPT-3.5 vs. GPT-4 vs. GPT-4 Turbo

GPT-4.5 Leaked: OpenAI's Next Big Move?

Test Design

Results:

领英推荐

Comparison across LLMs by Articles

Introducing FileX: An AI Powered Legal Document Management System

2024年7月3日

Agentic Generative AI is what you need!

2024年5月3日

Understanding the Two Tower Acrhitecture of Building Recommender Systems

2024年4月25日

How to build a Multi-Stage Recommender System

2024年4月17日

Challenges Faced while Building Personalisation Engines

2024年4月9日

How Recommendation Engines Help Increase ROI

2024年4月3日

Top 10 Predictions for AI in 2024

2024年2月7日

Recap of Challenges in AI in 2023

2024年2月5日

My Learnings from my first failed start-up experience!

2021年6月27日

Why Amazon's acquisition of MGM is a Win-Win!

2021年6月2日

社区洞察

其他会员也浏览了

GenAI - The Cost of Context.

GPT-4 Turbo Key Updates

Introducing OpenAI’s o1-preview: The Next Big Leap in AI Reasoning

Mixtral-8x7B: Overview and Benchmarks with Combining Mixtral and Flash Attention 2

OpenAI's GPT-4o Mini: A fine line between performance and costs

Evaluating Anthropic: A Smart Switch for Superior CX?

Revolutionizing Custom AI: OpenAI's GPT-3.5 Turbo Fine-Tuning Unleashed!

Using the OpenAI GPT API: A Comprehensive Guide

Exploring the Performance and Capabilities of GPT Models: GPT-3.5 vs. GPT-4 vs. GPT-4 Turbo

GPT-4.5 Leaked: OpenAI's Next Big Move?