Are all AIs the same?
Source: Image by Author

Are all AIs the same?


Almost everyone first started using LLM's through ChatGPT back in Nov 2022 who really opened the gates for GenAI to the general public.

It wasn't the First LLM and many others appeared since. Just take a look at this list with launch timelines dating back to 1954! But all compete with OpenAI to claim they are the new best alternative.

So... are they?

Are LLMs really different in how they work or outputs they give the average Joe? Or are they similar, just packaged a bit differently and a different logo?

Do we know what tradeoffs we are making when we pick them? And should we care?

I ran a short simple exercise to compare some key features across 3 of the top Providers using the latest trend of providing context via documents (RAG for Retrieval Augmented Generation) to find out who came out top - and the results were surprising!


How can I trust these AIs?

You will have seen countless anecdotal posts or memes about LLM hallucinations - when it just makes things up.. And it seems to be inevitable for LLM's to hallucinate just as Humans would as well for topics where they may not have the experience (ie. no accurate information exists in the model), or have been given misinformation, etc.. Definition below:

Hallucination: When an LLM generates plausible but factually incorrect output that deviates from the context, user input or world knowledge?[Zhang et?al.(2023)].

It is important to understand that "made-up" experiences are a big part of human beauty in creativity is imagining and creating fictitious content that is not grounded in real truths, and that this also impacts LLM's accuracy.

Enter RAG

Enter Retrieval Augmented Generation to save the day! This article summarizes and depicts RAG (and Finetuning which we will explain shortly) very well:

Retrieval-Augmented Generation (RAG) enhances the performance of LLMs on domain specific tasks by providing the model with an external source of information.

You can easily see below from steps 2 to 4 that impact the LLM before it outputs via step 6.

Source: T-RAG: Lessons from the LLM Trenches (arXiv:2402.07483v2)

What about Finetuning? What does it do?

We will not be performing Finetuning benchmarking today - that will be on my next post in this series - but I would like to share the definition and differences to RAG so you do understand what we are evaluating today specifically.

Finetuning is a method of incorporating domain knowledge into an LLM’s parametric memory by updating the model’s weights through training on a domain-specific labeled dataset such as a questions and answers dataset for Q&A applications?[Min et?al.(2017)]

So to do this exercise I would need to specifically give examples of expected behaviours to all of the LLM's being tested on top of running the RAG benchmark so that the models have more weighting, ie. more likelihood of considering generated content similar to the correct answers, as to generate truthful answers (at least according to my tuning examples which could still all be made up!).

Still confused about what they do or when to use them? This picture will help.

Source:

Now onwards to the actual Experiment...


AI Benchmarking Experiment Setup

Here is the base setup I used on OpenAI 's Chat GPT 4o, Mistral AI 's Le Chat "Large", Perplexity 's "Pro", and of course 谷歌 's Gemini with the following guidelines.

  • starting with the same Initial Prompt (or as close as I could get).
  • using the same document (or link to the document page) as main source.
  • picking the Pro or Large models available.

I then decided to structure the comparison of the AI main features evaluating them on:

  1. Prompting Experience - how easy to get started with a RAG-based document prompt.
  2. Model Outputs - including length, accuracy and completeness (relative), style, and references or citations.
  3. Follow-up Prompting - how well they behave on generating new creative content after initial RAG-focused prompt response.

Our Problem Statement

One of the biggest aspects of AI that will have plenty of ongoing debates for many years to come, is how do we govern such powerful AI tools and what roles and responsibilities do their providers and businesses that build on top of them have. This Technology will surely empower us but, at the same time as with all great inventions, creates newer and crucial safety risks.

The European Union's Artificial Intelligence Act starts to answer this in a 400+ page document that helps keep us abreast... if you have the time and energy to read it and the countless regulations that it impacts or references.

You may be thinking: Why not use AI to summarize it for us?

That's exactly what I did... The first prompt, alongside the upload of the PDF document, was set as follows:

I want you to act as an expert in Gen AI Technology and also on AI Policies focusing on People freedom of information. Please summarize the content of this document focusing on what the various AI Providers will have to offer, and any guidance for new companies and people using AI services.

Let's see how our models behaved against the 3 core features mentioned.


1. Prompting Experience

Let's start off focusing on how easy it is to use the tool and get an initial prompt and response.

TL;DR: They were pretty much the same for base features to get started: All 4 UIs are a familiar "Chat-bot" mode, all providers keep history, and all respond pretty quickly.

Source: Mistral's Large model UI for author prompt


The input experience however, started to vary right from the beginning...

ChatGPT 4o and Perplexity Pro

ChatGPT and Perplexity were both pretty simple in adding the document. A simple "Attach" allowed upload of the document and adding text to it as a prompt.

Source: Image by Author.

This feels more natural as the power of RAG as popularized by META's AI research team takes place.

Score: 3 points.

Mistral Large and Google Gemini

For these 2 providers though, no option to upload a file in the base UI. So the initial prompt had to be modified to reference the document we want.

I want you to act as an expert in Gen AI Technology and also on AI Policies focusing on People freedom of information. Please summarize the content of the document found in https://data.consilium.europa.eu/doc/document/PE-24-2024-INIT/en/pdf focusing on what the various AI Providers will have to offer, and any guidance for new companies and people using AI services.

The good thing is that it worked so my comparison still made sense to continue after the first 10mins.

Score: 1 points.


2. Model Output

Now the fun starts... we actually get an answer from our LLMs and comparable metrics such as length of text, accuracy or completeness (subjective but comparative between them!), and styles.

For all Providers Model's Raw Outputs please see this PDF attachment.

Length of Output (longer to shorter)

We will run through the Models in order of Longer to Shorter - although this metric should be balanced with the richness of the content, and preferences as could be that 'less is more'.

Perplexity Pro took the lead at 506 words. 465 if you remove the Citations descriptions. Score: 4 points.

ChatGPT 4o was a close second at 406 words created. Score: 3 points.

Mistral Large was an honorable 3rd with 369 words generated. Score: 2 points.

Google's Gemini came last (not to be confused with Disney's Jiminy Cricket) with a modest 110 words. Score: 1 points.


Completeness

Full Disclaimer: I did not read the full European Union's Artificial Intelligence Act (if you have, I salute you!). But I did read through the main sections and compared outputs to find blatant differences, omissions, and errors.

Perplexity Pro

Provided a summary paragraph explaining what the document is about and emphasis.

It then grouped the main corpus of the document into 2 sections: a) Obligations for AI Providers; and b) Guidance for New Companies and Users. although this grouping was not formally in the document.

Each section was relevant, although not in order it appears in the document. All the information read correctly and no errors were found.

Aspects in the document were omitted or don't have enough depth such as specifics on classification of token sizes or feedback loop mechanism controls - however this was a summary. It did conclude with a nice paragraphs summarizing the core of what the regulation aims to accomplish.

Score: 4 points.

ChatGPT 4o

A very short summary was provided at the start, and then OpenAI's model did the same thing as Perplexity rival when it grouped the answer in 2 sections a) Offerings and Obligations for AI Providers; and b) Guidance for New Companies and Users of AI Services.

Each section was accurate but more simplistic and with less sections so suffered on completeness. On one hand it made the text easily understandable, but at the same time lacked some specifics.

ChatGPT did also conclude with a view of the benefits these regulations will bring to AI and public interest.

Score: 2 points.

Mistral Large

Mistral provided a good summary of the document as intro, and was then the first model to group the answers in 3 sections a) AI Providers; b) Guidance for New Companies; and c) People Using AI Services.

The actual content seemed better structured and it included some aspects the other models didn't. As this was not a complete RAG based approach and likely used site scrapping on top of a regular LLM, this could justify why it is able to output things that did not appear many times throughout the document. This is speculation of course without knowing all the in's & out's of the base model.

This model too ended with a clear and concise summary of the purpose the regulations have on development of AI and human considerations.

Score: 3 points.

Google's Gemini

Google once again lagged behind. Their model provided an ultra-simplistic summary paragraph without intro, grouping of content, or conclusion. The text was accurate but so high-level that it didn't really add too much more than the original prompt.

Closed out asking if I wanted it to "summarize the document on a different aspect" which inspired no confidence.

Score: 1 points.


Style

To close out this part of the contest, let's look at style including any summary, descriptive paragraphs, bullet points, and conclusions.

Perplexity Pro provided a formal document-like structure that when copied had markdown format. The tone was very much similar to the one in the document. Score: 4 points.

ChatGPT 4o had a slightly lighter tone vs the content in the document, but structure and markdown format were also included. Score: 4 points.

Mistral Large outputs in normal Text. The tone is similar to the one in the document but slightly personalised. Score: 4 points.

Google's Gemini tone and style was formal but dry. It did not include any styling for output which is disappointing. Score: 1 points.

References and Citations

Perplexity Pro

The only one showcasing external references with 20 web references found - see some examples below.

However the actual output only referred to the document, making me wonder why and what it uses the external references for and hence the low score.

This will need further diving into on comparing RAG documents with conflicting information online and determining what the AI Model will choose.. If you have ideas on what or would like to see a specific example please drop me a note!

Source: PerplexityAI from Author Prompt.


Score: 2 points.

ChatGPT 4o

No sources were provided other than the document, and in the document it did not show which section it came from.

Score: 1 points.

Mistral Large

No sources referenced.

Score: 0 points.

Google's Gemini

No sources referenced other than the website provided.

Score: 0 points.


Half-time score is:
#1 Perplexity Pro - 17 points
#2 ChatGPT 4o - 13 points
#3 Mistral Large - 10 points
#4 Google Gemini -  4 points        

3. Follow-up Prompting

I decided to follow-up with a simple answer that required a bit of creativity outside of the document provided. Take this comparison with a pinch of salt as a single prompt is not enough, and most likely nor two or three are either.. this needs much more testing and benchmarking which I will write about in future articles.

Back to the Follow-up:

Can you specify the obligations that anyone deploying an AI system or service should provide in accordance to this regulation?

I provide a summary summarizing the previous metrics in a single paragraph.

Perplexity Pro

Our first Model provided a new answer including new web references, although none used officially. The content felt fresh and made sense in light of question and previous answer.

Score: 3 points.

ChatGPT 4o

Provided a good answer with structure bullet-points. The content was fresh and it covered aspects not in the previous prompt response. Score: 3 points.

Mistral Large

Mistral also kept up with the content accuracy and presentation as a new list of considerations. Score: 3 points.

Google Gemini

Surprisingly, Google's Gemini first did not want to answer as it fielded the question with

Source: Google Gemini from author Prompt.

But I didn’t give up… and further prompted to get an answer.

Please do so to the best of your abilities from the information provided in the original document.

It then returned a very short summary with 4 bullet points for high level info that did not teach much more than the original response.

Score: 1 points.


Conclusion

In conclusion, the experiment was quite fun and hope it gives an initial view when you set out to choose an AI.. it is likely they will keep advancing and competing to become the best one so being up-to-date and aware of their enhancements will be key for getting the best results.

  • Regarding Prompt Experience all four models behaved similarly with some minor differences on RAG support that didn't seem to impact this particular outcome at a first glance. This needs further exploration..
  • Model Outputs, arguably the most important comparison,?there were key distinctions between providers, from ability to explore web content in detail, to complete lack of citations, even frankly poor outputs from one of the main players (making them less powerful and practical)..
  • Follow-up Prompting had similar results as the main output, with only Google's model initially resisting continuing the conversation.

Final ranking:
1. Perplexity Pro is the winner at 20 points!
2. OpenAI's ChatGPT 4o comes 2nd with 16 points.
3. Mistral Large bagged a respectable 13 points.
4. Google Gemini disappointed racking up a mere 5 points.        

Hope you enjoyed this article. I'll be writing more about AI so follow me on LinkedIn, I occasionally Twitter or X, and will be writing some articles on Medium.

要查看或添加评论,请登录

Jo?o Melo Cabrita的更多文章

社区洞察

其他会员也浏览了