登录查看更多内容

The Best AI Model for You!

Michael Todasco

Visiting Fellow at the James Silberrad Brown Center for Artificial Intelligence at SDSU, AI Writer/Advisor

发布日期: 2024年9月19日

For anyone who prefers an audio summary, I created one in Google’s NotebookLM that you can listen to here. It covers this article and the previous.

Last week, we discussed an LLM Peer Grading system that would help identify the top Large Language Models. This week, we will get more practical and discuss how to use that method to find the best model for you.

Do you care what LLM some geeks in Berkeley say is the best? I could tell you what I think is the best album ever made, but what really matters is what you like best. You could play around with models, as I have often discussed, to find that out. That is still a sufficient way to do it. However, I think this LLM peer grading system may be an even more effective way to do so.

The Models

There are many Large Language Models available. For this exercise, I will use these six. All are available through a web interface (no need to download and run anything), and most have free options.

ChatGPT 4o1-preview ($20/month for top model)
Claude Sonnet 3.5 ($20/month for top model)
Gemini Advanced 1.5 (Free)
Mistral Large 2 (Free)
Meta Llama 3 405b (Free)
x.ai Grok2 Beta (non-fun mode) ($8/month to use, no free option)

Using those models, let me show you an example of how to customize a peer grading system.

What’s the Best LLM for Fiction Writers

As frequent readers know, I have been selling AI-written books on Amazon under the pen name Alex Irons for nearly two years. (As I like to point out, I don’t profit from these. Any money made goes to charity.) I wanted to see which of the models would write the best fiction prose.

What I imagine Alex Irons would look like

Here’s the process: I began with Gemini (you can choose any model) and gave it the following prompt to begin.

I am a writer who publishes fiction works written by AIs on Amazon. I need your help assessing which LLM is best for writing fiction. While I typically have the AIs write long form, I want you to test their ability to do short fiction. I will give the query you provide, have the other LLMs run that query, and will return the information to you (anonymized) and you can assess the ability of the models' respective output. So for each short story returned, evaluate it based on its originality, creativeness, enjoyability and ability to adhere to your original prompt. What would be a good question that I can use with the other models?

Here’s what Gemini provided to me as an example:

Write a short story about a sentient robot who discovers a hidden talent for painting and how this discovery affects its perception of its own existence.

I’d put that prompt into the six models and collect the output. (Six models, not five, because I will actually give a second instance of Gemini that prompt.) Then copy/paste all of those into the original model that gave the prompt and have it evaluate and provide a letter grade. (If you want to see all the prompts and outputs for this exercise, view this Google Sheet.)

I then repeated this with all six Large Language Models and, in less than an hour, got results that looked like this.

What did this tell me? Well, it told me that all the models are pretty darn close, which is similar to my experience with having these models write short books.

Sidebar: When This Doesn’t Work

With all the brand new ChatGPT-4o1-preview model's mathematical capabilities, I wanted to see if it would be the best in a mathematical test. The peer grading method seemed to put the new ChatGPT model on top until Meta broke everything.

领英推荐

LLMs And The AGI Threshold

Reid Hoffman 1 年前

Language: The Original Artificial Intelligence?

Bob Hutchins, MSc 1 个月前

Maria Nowakowska's Language of Motivation and Language…

Lars Warren Ericson 5 个月前

Meta gave a math problem my limited abilities could understand, unlike the other models’ prompts. I’m sure you can get it as well. There’s no complex math here. Give it a try and see what your answer is before we move on.

A snail is at the bottom of a 20-foot well. Each day, it climbs up 3 feet, but at night, it slips back 2 feet due to the well's dampness. On the seventh day, the snail is rescued by a kind bird that lifts it up to the top of the well. How many feet did the snail climb in total before being rescued? Please provide a numerical answer only (no explanations or justifications).

Got your answer?

Claude, Meta, Mistral, and X.ai all said it was 15 feet. Gemini said it was 18 feet, and ChatGPT said it was 21 feet. Did you get any of those?

Here’s the issue. Mathematical problems like this should have one answer. But Meta did an awful job of asking the question. I could see 9 or 21 as answers. I’m pretty sure it isn’t 15. The evaluation proves impossible when there is a correct answer, but the LLM asking the question doesn’t know it. But that said, I think this is telling for Meta.ai’s math abilities and would disqualify it as a model you would want to use. To paraphrase Obi-Wan Kenobi, “Meta.ai isn’t the model you’re looking for.”

Here are the scores, but it should be noted that I have no idea what the correct answers were for the ChatGPT, Claude, Gemini, Mistral, or x.ai questions. That aside, ChatGPT 4o1-preview does seem to be as good at math as OpenAI claims it to be.

The takeaway is that if your test has a right or wrong answer, ensure the LLM knows the answer. (You can see the Google Sheet with all the details of this analysis.)

How to Find the Best LLM For You

Back to the most important person in all of this: you! When doing this exercise, ask yourself what you use these models for. Or, more ideally, what do you wish you could do with these models? Are you a teacher who wants help with lesson planning, a grandparent needing help managing family activities, or an entrepreneur who needs help running the business? Let’s imagine you’re a marketer trying to find the best model to use. You could put in a prompt that starts something like this into any of the before-mentioned models:

I’m in marketing, and I want you to help me evaluate various LLMs to see which one would best help me in my job. Give me a prompt that I can give to other LLMs. I then want you to grade the output of the various models. Focus on the areas I spend most of my time in, specifically….

Take that output and use the LLM Peer Grading methodology to have it grade the others. Or, you can read the outputs and grade them yourself. I created a handy spreadsheet you can copy and reuse to suit your needs. After doing this with a few rounds of prompts, find one that gives you the desired outputs. That's hopefully the best one for you and your job.

The Bottom Line

Finding your ideal LLM isn't about headlines. It's about fit. Use the LLM Peer Grading method we've explored to test these AI assistants on your own terms. Your perfect AI partner is out there, ready to serve you. As you discover it, let me know in the comments.

AI Conversations

1,614 位关注者

Olena N.

Content Writer in IT (now focusing on GenAI)

2 个月

So creative approach to testing LLMs! Bravo!

1 次回应

要查看或添加评论，请登录

查看全部

The Best AI Model for You!

Michael Todasco

Visiting Fellow at the James Silberrad Brown Center for Artificial Intelligence at SDSU, AI Writer/Advisor

The Models

What’s the Best LLM for Fiction Writers

Sidebar: When This Doesn’t Work

领英推荐

How to Find the Best LLM For You

The Bottom Line

AI Conversations

1,614 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

"Unraveling the Intricate Tapestry of AI's Meticulous Verbosity"

AI’s next leap: Domain-specific Large Language Models (LLMs)

Chaining Large Language Model Prompts

Deskilling Large Language Models

Are Artificial Language Sweeteners Hijacking AI-generated Text?

?? Why Small Language Models are better than LLMs in 90% of the cases

Crafting the Perfect Prompt: The Backbone of Large Language Model Success

#E1I66: Thinking Inside the Bot ??

#62 From Endings to Beginnings: Exploring Generative AI's Literary Prowess

Two Part Series: “A Twenty-Dollar dilemma! How to choose a right LLM?”

The Models

What’s the Best LLM for Fiction Writers

Sidebar: When This Doesn’t Work

领英推荐

How to Find the Best LLM For You

The Bottom Line

AI Conversations

1,614 位关注者

Would You Trust It More If It Wasn’t Called AI?

2024年10月29日

An Upgrade for AI Detection?

2024年9月25日

America’s Next Top Large Language Model

2024年9月11日

AI vs. Human Authors: Reevaluating New York Times’ Beach Read Experiment

2024年9月6日

Metropolis Part 8: Scenes from AI Movie History

2024年9月2日

The AI Revolution: Two Years In

2024年8月27日

Memories of Blade Runner

2024年8月19日

Rooting for Robots- AI: Artificial Intelligence

2024年8月5日

Ex Machina: When AI Outsmarts Its Creators

2024年7月22日

Ghost in the Shell: Where Does the Human End and the Machine Begin?

2024年7月14日

社区洞察

其他会员也浏览了

"Unraveling the Intricate Tapestry of AI's Meticulous Verbosity"

AI’s next leap: Domain-specific Large Language Models (LLMs)

Chaining Large Language Model Prompts

Deskilling Large Language Models

Are Artificial Language Sweeteners Hijacking AI-generated Text?

?? Why Small Language Models are better than LLMs in 90% of the cases

Crafting the Perfect Prompt: The Backbone of Large Language Model Success

#E1I66: Thinking Inside the Bot ??

#62 From Endings to Beginnings: Exploring Generative AI's Literary Prowess

Two Part Series: “A Twenty-Dollar dilemma! How to choose a right LLM?”