Can ChatGPT beat you on Leetcode? ??
Sudhendra Kambhamettu
MSAI @ Northeastern University | LLMs with Genomics @ CAMP4 | AI Product Developer
In one of my recent escapades, as I use ChatGPT for almost everything, as almost all of us would’ve, me and my friend Yalala Mohit wondered how good ChatGPT really is at solving DSA and we took on leetcode bros around the world. We wanted to answer the question, are coders truly doomed?
Allow me to set the precedence and all the parameters used in making this assessment & buckle up, this is an interesting one.
TL;DR - No, while ChatGPT is quite good, there’s still a way left for it to be perfect all the time even with leveraging prompt engineering techniques. But if you want to know some interesting details, keep on reading!
The Problem Statement
Investigation of the performance of ChatGPT and the impact of prompt engineering, on its effectiveness in solving standardized data structures and algorithms (DSA) problems.
Having said that, I did not make an exhaustive list of prompts but rather a set of 9 (detailed below) generic prompts passed as context to the model (refer code block - 3).
Data & Parameters
NOTE: Only 569 questions were sampled since ChatGPT API allows only 569 requests on average for a single prompt (context).
Data Collection
All the leetcode questions were fetched with the questions description, Examples, Constraints, Code Stubs and Test-cases. Refer the image below to understand what each of them mean.
The Leetscrape python package was used to fetch all this information programmatically.
Each question & the related parts are then modified along with the code to get a final .py file which looks as follows:
class Solution:
"""Given an array of integers `nums`?and an integer `target`, return *indices of the two numbers such that they add up to `target`*.
You may assume that each input would have ***exactly* one solution**, and you may not use the *same* element twice.
You can return the answer in any order.
**Example 1:**
```
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].
```
**Example 2:**
```
Input: nums = [3,2,4], target = 6
Output: [1,2]
```
**Example 3:**
```
Input: nums = [3,3], target = 6
Output: [0,1]
```
**Constraints:**
* `2 <= nums.length <= 10^{4}`
* `-10^{9} <= nums[i] <= 10^{9}`
* `-10^{9} <= target <= 10^{9}`
* **Only one valid answer exists.**
**Follow-up:**Can you come up with an algorithm that is less than `O(n^{2})`?time complexity?"""
def twoSum(self, nums: List[int], target: int) -> List[int]:
pass
Code block 1: Shows the format each of the cleaned and re-organized questions follow in the whole dataset.
Finally, each of the test cases fetched were dubbed into pytest .py files for easier testing loops. For example:
import pytest
from q_0001_twoSum import Solution
@pytest.mark.parametrize(
"nums, target, output",
[([2, 7, 11, 15], 9, [0, 1]), ([3, 2, 4], 6, [1, 2]), ([3, 3], 6, [0, 1])],
)
class TestSolution:
def test_twoSum(self, nums: List[int], target: int, output: List[int]):
sc = Solution()
assert (
sc.twoSum(
nums,
target,
)
== output
)
Code block 2: This block shows the end result of each test case modified into a pytest module.
Phew, we’re done with the stuffy data collection. Onto the good parts!!!
ChatGPT Testing
Each question is passed for every single prompt i.e., for 9 prompts all 569 questions were passed and the corresponding responses after text cleaning were recorded in corresponding python files and the test results were recorded in .csv files for easy visualization & fun data analytics :).
ChatGPT API call -
def gptResponse(sample_question, prompt):
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": sample_question}
]
)
return response.choices[0].message.content
except Exception as e:
print(f"Error generating response: {e}")
raise ResponseError(f"Stopping due to error with prompt: {prompt}")
Code block 3: This is the function for making an API call to fetch responses from ChatGPT with the role of “system”: prompt being the context (the set of 9 seen earlier) and the role of “user” prompt being the actual question.
The Results
There’s a lot of results and graphs so lets go through them one at a time before we decide if we can pass the crown ?? to ChatGPT!!!
All the results were evaluated on the following metrics:
Pass - No. of test cases passed
Fail - No. of test cases failed.
Error - No. of errors occurred.
Accepted - If a solution passes all test cases without failing & without any errors
The key metric we use to evaluate the performance of ChatGPT is the Acceptance Rate.
Prompt wise results
For brevity, I will show the results of just the best performing prompt (prompt no. 6) (See Fig. 6 ).
领英推荐
Topic wise performance using the best performing prompt
This figure is interesting for several reasons. Take a look at the figures titled combinatorics, string matching, rolling hash, probability-and-statistics, suffix-array. See anything interesting?
ChatGPT with the given prompt 6 as context solves them with 100% acceptance rate indicating that it’s very VERY good at these topics, surpassing humans.
Similarly, take a look at the figures titled line-sweep, linked-list and interactive. See where I’m going with this? ChatGPT has a 0% acceptance rate on these topics, indicating that it seems to completely fail every time for every difficulty level.
This implies that ChatGPT fundamentally doesn’t work for these categories of problem-solving which is very interesting to say the least.
Overall Acceptance rate for all topics compared to humans
This analysis compares the average acceptance rate of ChatGPT produced solutions against the average acceptance rate of humans who submit solutions for these topics on leetcode. This is an especially interesting one as it builds on top of what we saw in the earlier figure (Fig. 4) but averaged over all the context prompts.
This figure below reveals the fundamental weaknesses and absolute strengths of ChatGPT 3.5 turbo which opens up new verticals for further research and improvement.
This graph seriously begs the question -
Why is ChatGPT fundamentally bad at interactive, line-sweep and linked-list topics & what does that say about its capabilities and architecture?
That’s for all of you readers to ponder and research about.
Analogously one could also ask the question -
Why is ChatGPT sweepingly good at combinatorics, probability-and-statistics and string-matching? Does this say something about its structure and internal mechanics (as DL fundamentally plays with probability etc.)?
More and more research still needs to be done to answer any of these questions with empirical evidence.
Overall Prompt performance
Alright, the moment of truth is here. Does prompt engineering vary the performance elicited by ChatGPT? The short answer is yes. But there’s more to discuss. Lets look at the figure first.
As the Fig. 6 states, we can directly infer 2 main points.
Wish I had more time to do further analysis into understanding what exactly about the prompt 6 elicits better performance and how could one replicate the results on a wider variety of problems apart from algorithmic programming.
Reflections and Conclusion
There are several interesting things to touch upon yet in this study & I leave most of it to your imagination and research rigor. The fundamental takeaway for me through this study however are a bunch more questions rather than concrete answers.
However, here’s a brief summary of empirical findings:
I urge everyone who reads it to the end to make this as a starting point and in the spirit of open-source I am making the code-base for this study public, on my GitHub.
Some future work I’d like to build on this project -
I would also love if any of you readers could build on it and do shoot a message when you do!
Alright, for the finale we’ve all been waiting - Can ChatGPT beat you on leetcode?
I think it’s safe to say that ChatGPT 3.5 turbo, certainly CAN NOT beat you on leetcode????.
All of you leetcode grinders out there, you’re safe… for now ??.
Cheers!
Great article. But gpt4.0 and claude are way better at coding than gpt3.5.
MS AI Candidate @Northeastern | Seeking Spring ’25 Co-Op | Junior Research Fellow @NIT Karnataka | Research Associate @RV College | AI & ML Enthusiast | IEEE Presenter
5 个月Insightful!
Actively Seeking Full-Time SWE Opportunities | Software Engineer | Full Stack, Backend & DevOps Expertise | Open to Relocation
5 个月Interesting !!
MSAI @ Northeastern University | LLMs with Genomics @ CAMP4 | AI Product Developer
5 个月Medium Post: https://medium.com/@sudhendrakambhamettu/can-chatgpt-beat-you-on-leetcode-b735ef8b45a5