登录查看更多内容

Using AI to Make AI: Claude 3.5 vs. Gemini 1.5 vs. ChatGPT 4-o vs. Github Copilot

Sam Glassenberg

Level Ex CEO | Advancing medicine through videogame technology and design

发布日期: 2024年8月13日

+ 关注

Which LLM Chatbot is best at writing AI code? Surprise!

I did the head-to-head comparison so you don’t have to.

Over the past few weeks I’ve been developing a number of AI projects to train various models (more on the specific models later). These include time-series forecasting, computer vision, and LLM-based applications.?I've learned some important insights from these experiments - and I'm happy to share them.

In these experiments, I left 98%+ of the AI coding to the LLM. I strategically write prompts, copy-paste code, let it train/predict for a few hours, and copy-paste results (fixing errors in between).

Using AI to create better AI...what could go wrong?

This conclusion surprised even me: Claude 3.5 Sonnet is the clear winner.

Sure, Claude 3.5 advertises scoring slightly higher on many coding benchmarks, but having worked in DirectX for many years I’ve always been skeptical of small differences in benchmark scores. This is a different story. Even when Google recently did its surprise announcement of its all-powerful Gemini 1.5 Pro, outperforming all other LLMs, it still lost slightly to Claude on coding.?

You’ll notice I didn’t include GitHub copilot in this table… that’s because its performance was so bad with initial tests that I didn’t even try it across further scenarios.?

Libraries Used

Here are some of the machine learning libraries and models used throughout these various experiments:

YOLO8
H2o
ChatGPT 4-o API
CLIP
Tensorflow
Google Cloud Vision
Pillow
OpenCV
Tesseract (OCR)
EAST
Scikit-learn / KMean

Libraries in bold could be used inside ChatGPT4-o’s query analyzer, but were unfortunately not sufficiently capable for my needs.

Score Breakdown

Here’s a Breakdown of the Scores. We’ll go into more details on these in future blog posts.

Knowledge of AI Algorithms and Libraries

Bringing some of my initial prompts over from ChatGPT4-o to Gemini, it was clear that Gemini 1.5 could reason far better about what models and algorithms to use and how to combine them. It did a much better job of layout out tradeoffs and expectations for each model.?

Claude performed as well as Gemini 1.5 did on planning, and substantially better on coding.

Code Reliability and Quality

Gemini’s codegen left much to be desired. Lots of syntax errors, and it would easily fall into traps where a fix for issue #1 would cause issue #2, and the fix for issue #2 would regress issue #1 (I’ll call this an LLM Regression Trap). This happens to ChatGPT all the time.?

Across multiple projects, I only ran into an LLM Regression Trap with Claude around 3 times, and it was quickly resolved by starting a new chat with all of my code files.

To date, Claude consistently outperforms (albeit slightly, score-wise) ChatGPT, Llama, and Gemini in coding benchmarks. I have to say, this 2% difference in benchmark score appears to have substantial impact on overall coding ability.?

Ability to Retain Context

This is where ChatGPT 4-o really hurts. It struggles to retain context in a larger-than-tiny codebase. It will quickly forget code and functions that don’t happen to be relevant to the particular discussion over the last few prompts.?

Gemini and Claude advertise larger context windows. I’m always skeptical of context window stats as it seems that LLMs still tend to forget important things as their context window fills. Nonetheless, I found that Claude and Gemini generally performed well at remembering projects with 10-12 code files. Once in a while, Claude would drop a feature we hadn’t used in a while, but you could reliably get it to add the feature back.

Ability to Run Code

ChatGPT is the only LLM that offers the ability to run code using its query analyzer. This tool is incredibly powerful for small tasks - and I use it all the time in my day-to-day life (Resize this PDF! Manipulate this image! Graph this or that!). I’ve shared some best-practices in the past for how to get the most use out of it.?

That being said, its execution environment is limited. This is as far as I could push it when it comes to AI algorithms:?

You can run OpenCV for basic contouring

…no, it can’t generate results like this ‘out of the box’. This image was the result of multiple phases of histogramming and clustering (which, to its credit, Query Analyzer was able to do).

You can use Tesseract for basic OCR

Unsurprisingly, the results on handwritten text aren’t as good as just asking ChatGPT to OCR it.?

On a limited basis, you can actually upload a trained model file for these libraries to use:

ChatGPT doesn’t have access to the internet, but you can download a fairly large model file from the internet and upload it to ChatGPT and it’ll run it!

IDE Integration

None of these systems have official IDE integration… yet. So there’s still a great deal of copy-pasting outputs from the LLM into your code files, and then copy-pasting the output back to the LLM. I’ll write an article shortly about the state-of-the-art and some tools I’m building (or rather having Claude build) to address this in the short term.

Claude has the best “agentic coding” capabilities, and so far the best plugin I’ve found for Visual Studio Code is Claude Dev. It will read the files from your project, and when it proposes changes you can authorize it with one click.?

Not quite ready for prime-time, though. Claude Dev just added the ability to re-run a command (before that, a Claude error would cripple the whole chat). It doesn’t have the kind of robust LLM chat management that you really need to have a meaningful and productive co-development session with your favorite LLM.

Web Interface Performance

Normally, this wouldn’t even be a category worthy of scoring. Unfortunately, the performance of Claude’s web GUI is so abysmal as conversations get longer it needs to be called out. Claude’s web GUI just becomes totally unusable at some point once a chat gets long enough.

Right now I’m using TypingMind to work around this. I must admit that TypingMind is so good that calling it a ‘workaround’ is a little demeaning. TypingMind is fantastic. You can run it locally, which means storing and backing up your LLM chats.?

While TypingMind is the cure for Claude’s terrible web GUI performance, it doesn’t address the secondary problem with Claude: Claude sends the full chat history to the API every time, which means you burn through tokens faster and faster as your chats get longer. The $35 chat is the least of your problems here. Paid users will run up against the 2M token limit quickly. The Anthropic Sales team needs to get around to your upgrade ticket to upgrade your tier and they are severely backlogged right now.?

In Summary

There are a surprising number of tradeoffs when choosing which LLM you are going to co-develop your AI code with. Fortunately, the platforms are quickly evolving and the rich ecosystem of 3rd-party solutions are rapidly filling the gaps.?

In the future, I’ll do a few dedicated posts around IDE tools and the many co-development best-practices… best way to stay up-to-speed is to follow me on LinkedIn.

ManyMangoes ??

13 小时前

This sounds like a fascinating comparison! It's great to see insights on the capabilities of different chatbots. Which factors did you find most impactful in your analysis?

Abhinav Ram Mohan, Ph.D.

AI Solutions Architect | Senior Data Scientist

3 个月

Awesome article! One line caught my attention - "ChatGPT doesn't have access to the internet", which is true. I am now curious on your thoughts on Perplexity.ai when they provide references to source material (albeit after many tests of my own it's still not quite there yet). Ask Gpt-4o to provide URLs to references of where it got its information from and it fails utterly because it has simply been trained on a corpus. https://youtu.be/F5IYZPNm_7M?si=4o5DdzAyJK_mFX32 Thus, I am curious if you have explored integrations with a web search api, paperscraper, other APIs, etc. What are your thoughts on these methodologies developing further?

Mike Burrows

CVP Advanced Graphics Program (AMD)

3 个月

Nice approach. Did coding competency change based on ordering of supplied code? Your comments on “losing” importance within context windows reminds me of some papers I read (iirc) re importance bias toward start and end when toward max limit of context window size… /thinking out loud ;)

查看更多评论

要查看或添加评论，请登录

Sam Glassenberg的更多文章

Will Apple's Vision Pro Give Medical VR Some Legs to Stand On?

2023年6月9日

Will Apple's Vision Pro Give Medical VR Some Legs to Stand On?

Apple's Vision Pro 'Spatial Computer' and companion VisionOS give us much to be hopeful about. With this WWDC sneak…

2 条评论
Forget ChatGPT4 - I'm Preparing My Prompts for GPT7

2023年3月21日

Forget ChatGPT4 - I'm Preparing My Prompts for GPT7

The mindbogglingly rapid evolution of OpenAI’s ChatGPT is making it hard to keep up with its quickly-expanding arsenal…

2 条评论
Marketing to ChatGPT: The Next Evolution of SEO?

2023年3月17日

Marketing to ChatGPT: The Next Evolution of SEO?

Have you tried asking ChatGPT about yourself or your company - by name? If you are Arnold Schwarzenegger, or if you…

5 条评论
Inducing a Stroke in ChatGPT… Could AI help Unlock the Mysteries of the Human Brain?

2023年2月27日

Inducing a Stroke in ChatGPT… Could AI help Unlock the Mysteries of the Human Brain?

One of the fascinating things about language models like ChatGPT is how they encode information. You can delve into…

3 条评论
The Most Mindblowing Realization About ChatGPT

2023年1月27日

The Most Mindblowing Realization About ChatGPT

My ongoing exploration into the inner workings of ChatGPT has brought me to a crucial understanding that many don’t…

21 条评论
Understanding GPT3 Under the Hood: A Technical Interview with ChatGPT

2023年1月22日

Understanding GPT3 Under the Hood: A Technical Interview with ChatGPT

Full “Interview” can be found here. I had a fascinating “interview” with ChatGPT last night, exploring her overall…

6 条评论
Cloud Gaming: The Best Treatment for COVID Oscillation Fatigue

2022年1月4日

Cloud Gaming: The Best Treatment for COVID Oscillation Fatigue

Is the next conference going physical, hybrid or virtual? Game technology doesn’t care. J.

2 条评论
Unleashing Videogame Tech on Medicine

2016年10月5日

Unleashing Videogame Tech on Medicine

It’s widely known that, statistically, a surgeon’s first 100 high-risk surgical procedures will have far worse outcomes…

53 条评论

See all articles

Which LLM Chatbot is best at writing AI code? Surprise!

Libraries Used

Score Breakdown

Knowledge of AI Algorithms and Libraries

Code Reliability and Quality

Ability to Retain Context

Ability to Run Code

You can run OpenCV for basic contouring

You can use Tesseract for basic OCR

On a limited basis, you can actually upload a trained model file for these libraries to use:

IDE Integration

Web Interface Performance

In Summary

Sam Glassenberg的更多文章

Will Apple's Vision Pro Give Medical VR Some Legs to Stand On?

Forget ChatGPT4 - I'm Preparing My Prompts for GPT7

Marketing to ChatGPT: The Next Evolution of SEO?

Inducing a Stroke in ChatGPT… Could AI help Unlock the Mysteries of the Human Brain?

The Most Mindblowing Realization About ChatGPT

Understanding GPT3 Under the Hood: A Technical Interview with ChatGPT

Cloud Gaming: The Best Treatment for COVID Oscillation Fatigue

Unleashing Videogame Tech on Medicine