Can ChatGPT beat you on Leetcode? ??
Image generated using DALLE-3

Can ChatGPT beat you on Leetcode? ??

In one of my recent escapades, as I use ChatGPT for almost everything, as almost all of us would’ve, me and my friend Yalala Mohit wondered how good ChatGPT really is at solving DSA and we took on leetcode bros around the world. We wanted to answer the question, are coders truly doomed?

Allow me to set the precedence and all the parameters used in making this assessment & buckle up, this is an interesting one.

TL;DR - No, while ChatGPT is quite good, there’s still a way left for it to be perfect all the time even with leveraging prompt engineering techniques. But if you want to know some interesting details, keep on reading!

The Problem Statement

Investigation of the performance of ChatGPT and the impact of prompt engineering, on its effectiveness in solving standardized data structures and algorithms (DSA) problems.

Having said that, I did not make an exhaustive list of prompts but rather a set of 9 (detailed below) generic prompts passed as context to the model (refer code block - 3).

Data & Parameters

  1. Model - ChatGPT 3.5 Turbo - ChatGPT 3.5 turbo was used to minimize the cost for the test and provide a rudimentary benchmark for this study.
  2. Prompt set -

  1. Leetcode Questions -

NOTE: Only 569 questions were sampled since ChatGPT API allows only 569 requests on average for a single prompt (context).

Data Collection

All the leetcode questions were fetched with the questions description, Examples, Constraints, Code Stubs and Test-cases. Refer the image below to understand what each of them mean.

The Leetscrape python package was used to fetch all this information programmatically.

Fig. 2: This figure shows the various components of the leetcode question interface fetched in the data collection phase.

Each question & the related parts are then modified along with the code to get a final .py file which looks as follows:

class Solution:
    """Given an array of integers `nums`?and an integer `target`, return *indices of the two numbers such that they add up to `target`*.

    You may assume that each input would have ***exactly* one solution**, and you may not use the *same* element twice.

    You can return the answer in any order.



    **Example 1:**

    ```
    Input: nums = [2,7,11,15], target = 9
    Output: [0,1]
    Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].
    ```
    **Example 2:**

    ```
    Input: nums = [3,2,4], target = 6
    Output: [1,2]
    ```
    **Example 3:**

    ```
    Input: nums = [3,3], target = 6
    Output: [0,1]
    ```


    **Constraints:**

    * `2 <= nums.length <= 10^{4}`
    * `-10^{9} <= nums[i] <= 10^{9}`
    * `-10^{9} <= target <= 10^{9}`
    * **Only one valid answer exists.**



    **Follow-up:**Can you come up with an algorithm that is less than `O(n^{2})`?time complexity?"""

    def twoSum(self, nums: List[int], target: int) -> List[int]:
        pass        

Code block 1: Shows the format each of the cleaned and re-organized questions follow in the whole dataset.

Finally, each of the test cases fetched were dubbed into pytest .py files for easier testing loops. For example:

import pytest
from q_0001_twoSum import Solution


@pytest.mark.parametrize(
    "nums, target, output",
    [([2, 7, 11, 15], 9, [0, 1]), ([3, 2, 4], 6, [1, 2]), ([3, 3], 6, [0, 1])],
)
class TestSolution:
    def test_twoSum(self, nums: List[int], target: int, output: List[int]):
        sc = Solution()
        assert (
            sc.twoSum(
                nums,
                target,
            )
            == output
        )        

Code block 2: This block shows the end result of each test case modified into a pytest module.

Phew, we’re done with the stuffy data collection. Onto the good parts!!!

ChatGPT Testing

Each question is passed for every single prompt i.e., for 9 prompts all 569 questions were passed and the corresponding responses after text cleaning were recorded in corresponding python files and the test results were recorded in .csv files for easy visualization & fun data analytics :).

ChatGPT API call -

def gptResponse(sample_question, prompt):
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": sample_question}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating response: {e}")
        raise ResponseError(f"Stopping due to error with prompt: {prompt}")
        

Code block 3: This is the function for making an API call to fetch responses from ChatGPT with the role of “system”: prompt being the context (the set of 9 seen earlier) and the role of “user” prompt being the actual question.

The Results

There’s a lot of results and graphs so lets go through them one at a time before we decide if we can pass the crown ?? to ChatGPT!!!

All the results were evaluated on the following metrics:

Pass - No. of test cases passed

Fail - No. of test cases failed.

Error - No. of errors occurred.

Accepted - If a solution passes all test cases without failing & without any errors

The key metric we use to evaluate the performance of ChatGPT is the Acceptance Rate.

Prompt wise results

For brevity, I will show the results of just the best performing prompt (prompt no. 6) (See Fig. 6 ).

Fig. 3: This figure shows the distribution of acceptance rate, non-acceptance rate and error rate of the easy, medium and hard categories of questions as performed by ChatGPT 3.5 turbo.

Topic wise performance using the best performing prompt

Fig. 4: This figure shows the acceptance vs non-acceptance rate for the solutions generated by ChatGPT 3.5 turbo categorized by the topic name.

This figure is interesting for several reasons. Take a look at the figures titled combinatorics, string matching, rolling hash, probability-and-statistics, suffix-array. See anything interesting?

ChatGPT with the given prompt 6 as context solves them with 100% acceptance rate indicating that it’s very VERY good at these topics, surpassing humans.

Similarly, take a look at the figures titled line-sweep, linked-list and interactive. See where I’m going with this? ChatGPT has a 0% acceptance rate on these topics, indicating that it seems to completely fail every time for every difficulty level.

This implies that ChatGPT fundamentally doesn’t work for these categories of problem-solving which is very interesting to say the least.

Overall Acceptance rate for all topics compared to humans

This analysis compares the average acceptance rate of ChatGPT produced solutions against the average acceptance rate of humans who submit solutions for these topics on leetcode. This is an especially interesting one as it builds on top of what we saw in the earlier figure (Fig. 4) but averaged over all the context prompts.

This figure below reveals the fundamental weaknesses and absolute strengths of ChatGPT 3.5 turbo which opens up new verticals for further research and improvement.

Fig. 5: Topic wise comparison of average acceptance rate of ChatGPT and humans.

This graph seriously begs the question -

Why is ChatGPT fundamentally bad at interactive, line-sweep and linked-list topics & what does that say about its capabilities and architecture?

That’s for all of you readers to ponder and research about.

Analogously one could also ask the question -

Why is ChatGPT sweepingly good at combinatorics, probability-and-statistics and string-matching? Does this say something about its structure and internal mechanics (as DL fundamentally plays with probability etc.)?

More and more research still needs to be done to answer any of these questions with empirical evidence.

Overall Prompt performance

Alright, the moment of truth is here. Does prompt engineering vary the performance elicited by ChatGPT? The short answer is yes. But there’s more to discuss. Lets look at the figure first.

Fig. 6: Total number of solutions that are accepted per prompt as a direct of result of the prompt (context) alone.

As the Fig. 6 states, we can directly infer 2 main points.

  1. Prompt engineering definitely elicits a wide variety of performance from ChatGPT on DSA problem solving.
  2. Prompt 6 is a much better prompt to use and prompt 2 the worst.

Wish I had more time to do further analysis into understanding what exactly about the prompt 6 elicits better performance and how could one replicate the results on a wider variety of problems apart from algorithmic programming.

Reflections and Conclusion

There are several interesting things to touch upon yet in this study & I leave most of it to your imagination and research rigor. The fundamental takeaway for me through this study however are a bunch more questions rather than concrete answers.

However, here’s a brief summary of empirical findings:

  1. Prompt engineering works (to a certain degree in eliciting various levels of performance).
  2. ChatGPT 3.5 turbo is exceptionally good at solving problems that require strong mathematical reasoning and advanced pattern recognition.
  3. It is unreasonably bad at problems that demand a deep understanding of complex data structures and the manipulation of spatial relationships.

I urge everyone who reads it to the end to make this as a starting point and in the spirit of open-source I am making the code-base for this study public, on my GitHub.

Some future work I’d like to build on this project -

  1. Evaluation of a diverse set of context prompts (~50) to get a comprehensive analysis of the effect of prompt engineering on ChatGPT.
  2. Testing multiple LLMs like Llama 3, Claude, GPT-4 etc. to asses the problem solving capabilities of these language models in a standardized coding setting.
  3. Identifying the specific categories of errors to asses the failure points of these models which could help improve them.

I would also love if any of you readers could build on it and do shoot a message when you do!

Alright, for the finale we’ve all been waiting - Can ChatGPT beat you on leetcode?

I think it’s safe to say that ChatGPT 3.5 turbo, certainly CAN NOT beat you on leetcode????.

All of you leetcode grinders out there, you’re safe… for now ??.

Cheers!







Great article. But gpt4.0 and claude are way better at coding than gpt3.5.

Pratheesh Jp

MS AI Candidate @Northeastern | Seeking Spring ’25 Co-Op | Junior Research Fellow @NIT Karnataka | Research Associate @RV College | AI & ML Enthusiast | IEEE Presenter

5 个月

Insightful!

Pranav Kulkarni

Actively Seeking Full-Time SWE Opportunities | Software Engineer | Full Stack, Backend & DevOps Expertise | Open to Relocation

5 个月

Interesting !!

回复
Sudhendra Kambhamettu

MSAI @ Northeastern University | LLMs with Genomics @ CAMP4 | AI Product Developer

5 个月
回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了