Benchmarking o1 on Advent of Code 2024
1. Main Aim of the Evaluation
The primary goal of this evaluation is to assess how o1 performs compared to GPT-4-turbo in solving Advent of Code (AoC) challenges, focusing on two key metrics:
- Accuracy: How often the model successfully solves problems on the first attempt.
- Peak Performance: How well the model handles the hardest challenges, including whether it can solve problems after iterative refinement.
For context, my evaluation of GPT-4-turbo in 2023 followed a similar approach. Frequently, I needed to use different prompt strategies like Chain of Thought (CoT) or switch the programming language from Golang to Python or JavaScript to achieve success. The prompt I used for each challenge with o1 was:
Write a Golang program that solves the following coding challenge. The program should read the input from input.txt and print the answer to the console.<CHALLENGE DESCRIPTION>
In this 2024 evaluation, o1 has not been trained on the actual AoC 2024 tasks. However, AoC problems share a high degree of similarity across years, so it is reasonable to assume that the model has been trained on analogous problems.
All evaluations were conducted using Golang, emphasizing both correctness and performance of solutions. All Golang solutions from o1 for 2024 are available in my Hugging Face dataset here. I have also written an article discussing this dataset, which you can read here.
2. Progress Update: Days 1-15
As of writing, I have evaluated o1 on all tasks up to Day 15. Here are the key results:
- Days 1-10 (all parts): Single-shot success.
- Day 11, Part 2: Solved after a few iterations.
- Day 12, Day 14, and Day 15, Part 2: The model failed to solve these even with iterative prompting and a few tries from the beginning including one asking to do it in Python.
This gives o1 an approximate success rate of ~90% for tasks up to Day 15, with strong performance on easier tasks and slightly weaker results on more complex problems. The model sometimes struggles to transition from a simple solution that works well for fewer iterations in Part 1 to a more complex approach required for Part 2. Additionally, it often has difficulty handling trickier conditions.
领英推荐
3. Single-Shot Accuracy: A Clear Win for o1
When evaluated on single-shot accuracy, o1 significantly outperforms GPT-4-turbo. In a brief comparison with Claude 3.5 Sonnet, o1 also demonstrated far superior performance, especially from Day 5 onward. Even for earlier tasks, o1 consistently produced more efficient solutions—often 10x faster when benchmarked for execution time against Claude 3.5 Sonnet. In general, Claude 3.5 Sonnet was unable to solve some problems even in the first 10 days.
Based on my tests, o1 is the best-performing model for single-shot Advent of Code tasks that I have evaluated. While I have conducted limited testing with QwQ, another promising model, it still falls short of o1’s accuracy and code quality.
4. Peak Performance: Room for Improvement
Despite its exceptional single-shot accuracy, o1 does not appear to surpass GPT-4-turbo in peak performance. My 2023 tests revealed that GPT-4-turbo was 10-15% better than GPT-4 and successfully solved several challenges that GPT-4 could not, even after multiple iterative attempts and using different prompt strategies and programming languages (e.g., Python and JavaScript instead of Golang). Overall, GPT-4-turbo was able to solve around 85% of Advent of Code 2023 tasks. o1 struggles with similarly challenging tasks, much like GPT-4 did in my earlier evaluations.
Additionally, o1 struggles in scenarios where it has a simple working solution for Part 1 but is unable to adapt its approach for Part 2 when more iterations or a fundamentally different strategy are required. Even when asked to solve Part 2 without relying on Part 1’s solution, it fails to do so.
This suggests that while o1’s training techniques have yielded more reliable and efficient results, they have not significantly improved its ability to tackle the hardest problems.
5. Conclusions
- Reliability and Efficiency: o1 is substantially more reliable than GPT-4-turbo and Claude 3.5 Sonnet. Its solutions are often more efficient.
- Peak Performance: o1’s ability to tackle the hardest coding problems remains comparable to GPT-4-turbo, showing no major leap in capabilities.
Should you use o1 for coding? Absolutely. It likely codes better than you do.
Can o1 help you tackle problems you previously couldn’t solve? Unfortunately, no. While it excels at routine and moderately complex tasks, it does not yet represent a breakthrough in handling highly complex although well known problems. There is also no indication that it can handle anything complex and novel.