登录查看更多内容

Benchmarking o1 on Advent of Code 2024

Aleksandar Dimov

Elixir Developer

发布日期: 2024年12月15日

1. Main Aim of the Evaluation

The primary goal of this evaluation is to assess how o1 performs compared to GPT-4-turbo in solving Advent of Code (AoC) challenges, focusing on two key metrics:

- Accuracy: How often the model successfully solves problems on the first attempt.

- Peak Performance: How well the model handles the hardest challenges, including whether it can solve problems after iterative refinement.

For context, my evaluation of GPT-4-turbo in 2023 followed a similar approach. Frequently, I needed to use different prompt strategies like Chain of Thought (CoT) or switch the programming language from Golang to Python or JavaScript to achieve success. The prompt I used for each challenge with o1 was:

Write a Golang program that solves the following coding challenge. The program should read the input from input.txt and print the answer to the console.<CHALLENGE DESCRIPTION>

In this 2024 evaluation, o1 has not been trained on the actual AoC 2024 tasks. However, AoC problems share a high degree of similarity across years, so it is reasonable to assume that the model has been trained on analogous problems.

All evaluations were conducted using Golang, emphasizing both correctness and performance of solutions. All Golang solutions from o1 for 2024 are available in my Hugging Face dataset here. I have also written an article discussing this dataset, which you can read here.

2. Progress Update: Days 1-15

As of writing, I have evaluated o1 on all tasks up to Day 15. Here are the key results:

- Days 1-10 (all parts): Single-shot success.

- Day 11, Part 2: Solved after a few iterations.

- Day 12, Day 14, and Day 15, Part 2: The model failed to solve these even with iterative prompting and a few tries from the beginning including one asking to do it in Python.

This gives o1 an approximate success rate of ~90% for tasks up to Day 15, with strong performance on easier tasks and slightly weaker results on more complex problems. The model sometimes struggles to transition from a simple solution that works well for fewer iterations in Part 1 to a more complex approach required for Part 2. Additionally, it often has difficulty handling trickier conditions.

领英推荐

Build a RAG App in Python Using Llama 3.2 ??

Clarifai 5 个月前

A Guide to Integrating Pythia API with RAG-based…

Wisecube 3 个月前

Mojo: The best Python killer and its effect on the AI…

SapotaCorp 3 个月前

3. Single-Shot Accuracy: A Clear Win for o1

When evaluated on single-shot accuracy, o1 significantly outperforms GPT-4-turbo. In a brief comparison with Claude 3.5 Sonnet, o1 also demonstrated far superior performance, especially from Day 5 onward. Even for earlier tasks, o1 consistently produced more efficient solutions—often 10x faster when benchmarked for execution time against Claude 3.5 Sonnet. In general, Claude 3.5 Sonnet was unable to solve some problems even in the first 10 days.

Based on my tests, o1 is the best-performing model for single-shot Advent of Code tasks that I have evaluated. While I have conducted limited testing with QwQ, another promising model, it still falls short of o1’s accuracy and code quality.

4. Peak Performance: Room for Improvement

Despite its exceptional single-shot accuracy, o1 does not appear to surpass GPT-4-turbo in peak performance. My 2023 tests revealed that GPT-4-turbo was 10-15% better than GPT-4 and successfully solved several challenges that GPT-4 could not, even after multiple iterative attempts and using different prompt strategies and programming languages (e.g., Python and JavaScript instead of Golang). Overall, GPT-4-turbo was able to solve around 85% of Advent of Code 2023 tasks. o1 struggles with similarly challenging tasks, much like GPT-4 did in my earlier evaluations.

Additionally, o1 struggles in scenarios where it has a simple working solution for Part 1 but is unable to adapt its approach for Part 2 when more iterations or a fundamentally different strategy are required. Even when asked to solve Part 2 without relying on Part 1’s solution, it fails to do so.

This suggests that while o1’s training techniques have yielded more reliable and efficient results, they have not significantly improved its ability to tackle the hardest problems.

5. Conclusions

- Reliability and Efficiency: o1 is substantially more reliable than GPT-4-turbo and Claude 3.5 Sonnet. Its solutions are often more efficient.

- Peak Performance: o1’s ability to tackle the hardest coding problems remains comparable to GPT-4-turbo, showing no major leap in capabilities.

Should you use o1 for coding? Absolutely. It likely codes better than you do.

Can o1 help you tackle problems you previously couldn’t solve? Unfortunately, no. While it excels at routine and moderately complex tasks, it does not yet represent a breakthrough in handling highly complex although well known problems. There is also no indication that it can handle anything complex and novel.

要查看或添加评论，请登录

Aleksandar Dimov的更多文章

Advent of Code Solutions Dataset

2024年11月23日

Advent of Code Solutions Dataset

Introduction This dataset contains over 10,000 solutions and input data for the Advent of Code programming puzzles from…

2 条评论
Coderev: AI-Powered Code Review from the Command Line

2024年11月10日

Coderev: AI-Powered Code Review from the Command Line

Introduction I'm pleased to share Coderev, an AI-powered CLI tool that simplifies the code review process for…

2 条评论
Testing LLMs for Web and Game Development

2024年11月3日

Testing LLMs for Web and Game Development

As an experiment to explore the capabilities of AI-assisted development, I recently completed a project building a…

1 条评论

Benchmarking o1 on Advent of Code 2024

Aleksandar Dimov

Elixir Developer

1. Main Aim of the Evaluation

2. Progress Update: Days 1-15

领英推荐

3. Single-Shot Accuracy: A Clear Win for o1

4. Peak Performance: Room for Improvement

5. Conclusions

Aleksandar Dimov的更多文章

社区洞察

其他会员也浏览了

Exploring the Top 10 Python Projects 2024

The RTutor Project, Python Resources, Git Full Course, Connecting the Dots

Solving a Labyrinth with Backtracking: A Guide to Enhancements

Why AI Platforms Favor Python and Its Potential to Dominate Future Programming

Generative AI - The New Compiler

??Building Multi-Agent LLM Systems with PydanticAI Framework: A Step-by-Step Guide To Create AI Agents With Ease??

Introduction to SimPy: Discrete-Event Simulation for Modelling Processes in Python

The New Leaders Taking Charge by Mastering Python and Controlling A.I.

Exploring the Power of Facebook AI Similarity Search Library

Regression Using Python: a Full Guidance

1. Main Aim of the Evaluation

2. Progress Update: Days 1-15

领英推荐

3. Single-Shot Accuracy: A Clear Win for o1

4. Peak Performance: Room for Improvement

5. Conclusions

Aleksandar Dimov的更多文章

Advent of Code Solutions Dataset

Coderev: AI-Powered Code Review from the Command Line

Testing LLMs for Web and Game Development

社区洞察

其他会员也浏览了

Exploring the Top 10 Python Projects 2024

The RTutor Project, Python Resources, Git Full Course, Connecting the Dots

Solving a Labyrinth with Backtracking: A Guide to Enhancements

Why AI Platforms Favor Python and Its Potential to Dominate Future Programming

Generative AI - The New Compiler

??Building Multi-Agent LLM Systems with PydanticAI Framework: A Step-by-Step Guide To Create AI Agents With Ease??

Introduction to SimPy: Discrete-Event Simulation for Modelling Processes in Python

The New Leaders Taking Charge by Mastering Python and Controlling A.I.

Exploring the Power of Facebook AI Similarity Search Library

Regression Using Python: a Full Guidance