Using AI to Make AI: Claude 3.5 vs. Gemini 1.5 vs. ChatGPT 4-o vs. Github Copilot

Using AI to Make AI: Claude 3.5 vs. Gemini 1.5 vs. ChatGPT 4-o vs. Github Copilot

Which LLM Chatbot is best at writing AI code? Surprise!

I did the head-to-head comparison so you don’t have to.

Over the past few weeks I’ve been developing a number of AI projects to train various models (more on the specific models later). These include time-series forecasting, computer vision, and LLM-based applications.?I've learned some important insights from these experiments - and I'm happy to share them.

In these experiments, I left 98%+ of the AI coding to the LLM. I strategically write prompts, copy-paste code, let it train/predict for a few hours, and copy-paste results (fixing errors in between).

Using AI to create better AI...what could go wrong?

This conclusion surprised even me: Claude 3.5 Sonnet is the clear winner.

Sure, Claude 3.5 advertises scoring slightly higher on many coding benchmarks, but having worked in DirectX for many years I’ve always been skeptical of small differences in benchmark scores. This is a different story. Even when Google recently did its surprise announcement of its all-powerful Gemini 1.5 Pro, outperforming all other LLMs, it still lost slightly to Claude on coding.?

The Final Scorecard

You’ll notice I didn’t include GitHub copilot in this table… that’s because its performance was so bad with initial tests that I didn’t even try it across further scenarios.?

Libraries Used

Here are some of the machine learning libraries and models used throughout these various experiments:

  • YOLO8
  • H2o
  • ChatGPT 4-o API
  • CLIP
  • Tensorflow
  • Google Cloud Vision
  • Pillow
  • OpenCV
  • Tesseract (OCR)
  • EAST
  • Scikit-learn / KMean

Libraries in bold could be used inside ChatGPT4-o’s query analyzer, but were unfortunately not sufficiently capable for my needs.

Score Breakdown

Here’s a Breakdown of the Scores. We’ll go into more details on these in future blog posts.

Knowledge of AI Algorithms and Libraries

Bringing some of my initial prompts over from ChatGPT4-o to Gemini, it was clear that Gemini 1.5 could reason far better about what models and algorithms to use and how to combine them. It did a much better job of layout out tradeoffs and expectations for each model.?

Claude performed as well as Gemini 1.5 did on planning, and substantially better on coding.

Code Reliability and Quality

Gemini’s codegen left much to be desired. Lots of syntax errors, and it would easily fall into traps where a fix for issue #1 would cause issue #2, and the fix for issue #2 would regress issue #1 (I’ll call this an LLM Regression Trap). This happens to ChatGPT all the time.?

Across multiple projects, I only ran into an LLM Regression Trap with Claude around 3 times, and it was quickly resolved by starting a new chat with all of my code files.

To date, Claude consistently outperforms (albeit slightly, score-wise) ChatGPT, Llama, and Gemini in coding benchmarks. I have to say, this 2% difference in benchmark score appears to have substantial impact on overall coding ability.?

Ability to Retain Context

This is where ChatGPT 4-o really hurts. It struggles to retain context in a larger-than-tiny codebase. It will quickly forget code and functions that don’t happen to be relevant to the particular discussion over the last few prompts.?

Gemini and Claude advertise larger context windows. I’m always skeptical of context window stats as it seems that LLMs still tend to forget important things as their context window fills. Nonetheless, I found that Claude and Gemini generally performed well at remembering projects with 10-12 code files. Once in a while, Claude would drop a feature we hadn’t used in a while, but you could reliably get it to add the feature back.

Ability to Run Code

ChatGPT is the only one... for now

ChatGPT is the only LLM that offers the ability to run code using its query analyzer. This tool is incredibly powerful for small tasks - and I use it all the time in my day-to-day life (Resize this PDF! Manipulate this image! Graph this or that!). I’ve shared some best-practices in the past for how to get the most use out of it.?

That being said, its execution environment is limited. This is as far as I could push it when it comes to AI algorithms:?

You can run OpenCV for basic contouring


…no, it can’t generate results like this ‘out of the box’. This image was the result of multiple phases of histogramming and clustering (which, to its credit, Query Analyzer was able to do).

You can use Tesseract for basic OCR

Unsurprisingly, the results on handwritten text aren’t as good as just asking ChatGPT to OCR it.?

On a limited basis, you can actually upload a trained model file for these libraries to use:

ChatGPT doesn’t have access to the internet, but you can download a fairly large model file from the internet and upload it to ChatGPT and it’ll run it!

?

This was fun to do from my iPhone.?

IDE Integration

None of these systems have official IDE integration… yet. So there’s still a great deal of copy-pasting outputs from the LLM into your code files, and then copy-pasting the output back to the LLM. I’ll write an article shortly about the state-of-the-art and some tools I’m building (or rather having Claude build) to address this in the short term.

Claude has the best “agentic coding” capabilities, and so far the best plugin I’ve found for Visual Studio Code is Claude Dev. It will read the files from your project, and when it proposes changes you can authorize it with one click.?

Not quite ready for prime-time, though. Claude Dev just added the ability to re-run a command (before that, a Claude error would cripple the whole chat). It doesn’t have the kind of robust LLM chat management that you really need to have a meaningful and productive co-development session with your favorite LLM.

Web Interface Performance

Normally, this wouldn’t even be a category worthy of scoring. Unfortunately, the performance of Claude’s web GUI is so abysmal as conversations get longer it needs to be called out. Claude’s web GUI just becomes totally unusable at some point once a chat gets long enough.

Right now I’m using TypingMind to work around this. I must admit that TypingMind is so good that calling it a ‘workaround’ is a little demeaning. TypingMind is fantastic. You can run it locally, which means storing and backing up your LLM chats.?

While TypingMind is the cure for Claude’s terrible web GUI performance, it doesn’t address the secondary problem with Claude: Claude sends the full chat history to the API every time, which means you burn through tokens faster and faster as your chats get longer. The $35 chat is the least of your problems here. Paid users will run up against the 2M token limit quickly. The Anthropic Sales team needs to get around to your upgrade ticket to upgrade your tier and they are severely backlogged right now.?

In Summary

There are a surprising number of tradeoffs when choosing which LLM you are going to co-develop your AI code with. Fortunately, the platforms are quickly evolving and the rich ecosystem of 3rd-party solutions are rapidly filling the gaps.?

In the future, I’ll do a few dedicated posts around IDE tools and the many co-development best-practices… best way to stay up-to-speed is to follow me on LinkedIn.

This sounds like a fascinating comparison! It's great to see insights on the capabilities of different chatbots. Which factors did you find most impactful in your analysis?

回复
Abhinav Ram Mohan, Ph.D.

AI Solutions Architect | Senior Data Scientist

3 个月

Awesome article! One line caught my attention - "ChatGPT doesn't have access to the internet", which is true. I am now curious on your thoughts on Perplexity.ai when they provide references to source material (albeit after many tests of my own it's still not quite there yet). Ask Gpt-4o to provide URLs to references of where it got its information from and it fails utterly because it has simply been trained on a corpus. https://youtu.be/F5IYZPNm_7M?si=4o5DdzAyJK_mFX32 Thus, I am curious if you have explored integrations with a web search api, paperscraper, other APIs, etc. What are your thoughts on these methodologies developing further?

回复
Mike Burrows

CVP Advanced Graphics Program (AMD)

3 个月

Nice approach. Did coding competency change based on ordering of supplied code? Your comments on “losing” importance within context windows reminds me of some papers I read (iirc) re importance bias toward start and end when toward max limit of context window size… /thinking out loud ;)

回复

要查看或添加评论,请登录

Sam Glassenberg的更多文章