The Best AI Model for You!
Michael Todasco
Visiting Fellow at the James Silberrad Brown Center for Artificial Intelligence at SDSU, AI Writer/Advisor
For anyone who prefers an audio summary, I created one in Google’s NotebookLM that you can listen to here. It covers this article and the previous.
Last week, we discussed an LLM Peer Grading system that would help identify the top Large Language Models. This week, we will get more practical and discuss how to use that method to find the best model for you.
Do you care what LLM some geeks in Berkeley say is the best? I could tell you what I think is the best album ever made, but what really matters is what you like best. You could play around with models, as I have often discussed, to find that out. That is still a sufficient way to do it. However, I think this LLM peer grading system may be an even more effective way to do so.
The Models
There are many Large Language Models available. For this exercise, I will use these six. All are available through a web interface (no need to download and run anything), and most have free options.
Using those models, let me show you an example of how to customize a peer grading system.
What’s the Best LLM for Fiction Writers
As frequent readers know, I have been selling AI-written books on Amazon under the pen name Alex Irons for nearly two years. (As I like to point out, I don’t profit from these. Any money made goes to charity.) I wanted to see which of the models would write the best fiction prose.
Here’s the process: I began with Gemini (you can choose any model) and gave it the following prompt to begin.
I am a writer who publishes fiction works written by AIs on Amazon. I need your help assessing which LLM is best for writing fiction. While I typically have the AIs write long form, I want you to test their ability to do short fiction. I will give the query you provide, have the other LLMs run that query, and will return the information to you (anonymized) and you can assess the ability of the models' respective output. So for each short story returned, evaluate it based on its originality, creativeness, enjoyability and ability to adhere to your original prompt. What would be a good question that I can use with the other models?
Here’s what Gemini provided to me as an example:
Write a short story about a sentient robot who discovers a hidden talent for painting and how this discovery affects its perception of its own existence.
I’d put that prompt into the six models and collect the output. (Six models, not five, because I will actually give a second instance of Gemini that prompt.) Then copy/paste all of those into the original model that gave the prompt and have it evaluate and provide a letter grade. (If you want to see all the prompts and outputs for this exercise, view this Google Sheet.)
I then repeated this with all six Large Language Models and, in less than an hour, got results that looked like this.
What did this tell me? Well, it told me that all the models are pretty darn close, which is similar to my experience with having these models write short books.
Sidebar: When This Doesn’t Work
With all the brand new ChatGPT-4o1-preview model's mathematical capabilities, I wanted to see if it would be the best in a mathematical test. The peer grading method seemed to put the new ChatGPT model on top until Meta broke everything.
领英推荐
Meta gave a math problem my limited abilities could understand, unlike the other models’ prompts. I’m sure you can get it as well. There’s no complex math here. Give it a try and see what your answer is before we move on.
A snail is at the bottom of a 20-foot well. Each day, it climbs up 3 feet, but at night, it slips back 2 feet due to the well's dampness. On the seventh day, the snail is rescued by a kind bird that lifts it up to the top of the well. How many feet did the snail climb in total before being rescued? Please provide a numerical answer only (no explanations or justifications).
Got your answer?
Claude, Meta, Mistral, and X.ai all said it was 15 feet. Gemini said it was 18 feet, and ChatGPT said it was 21 feet. Did you get any of those?
Here’s the issue. Mathematical problems like this should have one answer. But Meta did an awful job of asking the question. I could see 9 or 21 as answers. I’m pretty sure it isn’t 15. The evaluation proves impossible when there is a correct answer, but the LLM asking the question doesn’t know it. But that said, I think this is telling for Meta.ai’s math abilities and would disqualify it as a model you would want to use. To paraphrase Obi-Wan Kenobi, “Meta.ai isn’t the model you’re looking for.”
Here are the scores, but it should be noted that I have no idea what the correct answers were for the ChatGPT, Claude, Gemini, Mistral, or x.ai questions. That aside, ChatGPT 4o1-preview does seem to be as good at math as OpenAI claims it to be.
The takeaway is that if your test has a right or wrong answer, ensure the LLM knows the answer. (You can see the Google Sheet with all the details of this analysis.)
How to Find the Best LLM For You
Back to the most important person in all of this: you! When doing this exercise, ask yourself what you use these models for. Or, more ideally, what do you wish you could do with these models? Are you a teacher who wants help with lesson planning, a grandparent needing help managing family activities, or an entrepreneur who needs help running the business? Let’s imagine you’re a marketer trying to find the best model to use. You could put in a prompt that starts something like this into any of the before-mentioned models:
I’m in marketing, and I want you to help me evaluate various LLMs to see which one would best help me in my job. Give me a prompt that I can give to other LLMs. I then want you to grade the output of the various models. Focus on the areas I spend most of my time in, specifically….
Take that output and use the LLM Peer Grading methodology to have it grade the others. Or, you can read the outputs and grade them yourself. I created a handy spreadsheet you can copy and reuse to suit your needs. After doing this with a few rounds of prompts, find one that gives you the desired outputs. That's hopefully the best one for you and your job.
The Bottom Line
Finding your ideal LLM isn't about headlines. It's about fit. Use the LLM Peer Grading method we've explored to test these AI assistants on your own terms. Your perfect AI partner is out there, ready to serve you. As you discover it, let me know in the comments.
Content Writer in IT (now focusing on GenAI)
2 个月So creative approach to testing LLMs! Bravo!