I tried using ChatGPT (GPT-4) to grade coding assignments and this is how it went

GPT-4 was released just a few days ago, and like many others, I watched the demo and was very excited about its new potential capabilities and applications.

Then, I had the opportunity to experiment with grading coding assignments using ChatGPT, powered by GPT-4. I was hyped about the idea and had high expectations that it could help me save some time.

Spoiler alert: I decided to ditch it after a few attempts. Goodbye high-tech, hello analog.

The process

  • I signed up for a paid ChatGPT account, which gave me access to GPT-4.
  • I created a text-only version of the grading rubric which included eight components that summed up to 100 points in total. In order to test out to see if ChatGPT was able to follow the rubric format, I tested with a smaller subset of components first then expanded it to include all components.
  • Then I created the initial prompt that asked it to follow the rubric to grade the HTML code, then I pasted the code at the end of the prompt. I wanted to see the thought process, so I asked to assign grades to each section and provide a brief explanation on why. Then I asked it to sum up the total score.
  • I spent 4+ hours and made 20+ revisions/refinement of the prompt to improve the results. My approach to iterating on the prompts ranged from asking it to re-do the process to directly calling out the error, and even asking to act as an adversary to identify errors in the previous response.
  • I manually inspected the answers to see if ChatGPT's scoring was correct.

The results? Well, here’s a few things that I learned…

Scoring was inconsistent even with the same prompt

I ran the same prompt with the same code several times to see if the scoring was consistent. In fact, the scores seemed to change every time! So I confronted it:

Screenshot of ChatGPT interface. Prompt is asking why it scored higher for the same code, and why it gave an incorrect score. ChatGPT makes an apology and updates the answer.

ChatGPT apologized and re-did the scoring by basically swapping out the incorrect answer with my “feedback” and failed to answer WHY it made the mistake, which is what I wanted to get at.

It takes in your feedback…too well

In one of the assignments, the image was inserted in the code but was not rendering correctly which should have been reflected in the scoring. ChatGPT initially gave a perfect score for the image portion (20/20), so I wanted to probe further:

Screenshot of ChatGPT interface. Prompt asks to double check the scoring of the image and update the score as needed.

Without any questioning, ChatGPT simply accepted my “feedback” and changed its scoring, but came up with a strange reason (“the space in the URL”) which wasn’t the actual issue.

I challenged its answer by asking to indicate where exactly the problem it mentioned was located:

Screenshot of ChatGPT interface

It simply took back the reason that it proposed, and reverted back to the initial score that it had assigned in the beginning—which, by the way, was still incorrect. No logical explanation, no standing ground based on its own observation…just being wishy-washy based on my comments. At this point, I had lost some faith in the ability of ChatGPT to be a competent collaborator in this task.

It can give you feedback on how to ask better questions

Then I took a more reflective approach—I asked WHY it made the mistake and asked to shine some light on how I can revise the prompts to reduce errors:

No alt text provided for this image

Its apologies now started sounding like a broken record to me and made it seem (even) less human.. Putting that aside, the feedback that it gave regarding the prompt was quite insightful. Based on the feedback, I updated my prompt to include the sentence ”Make sure to double-check each section of the rubric against the code and assign a score for each component.”

And everything seemed to get better…

…except it didn’t. While it sounded good in theory, the results did not improve dramatically. It continued to produce errors that didn’t seem to have an obvious pattern, even with the exact same prompt.

(Could I have been able to see a pattern if I observed it long enough? Maybe, but most people don’t have the time and patience.)

It’s surprisingly bad at basic math

Another unexpected challenge was summing up partial scores to get the final score.

The rubric contained 8 sections with partial scores that summed up to 100 points in total. Yet, while the prompt asked for “the total score out of the total maximum possible score,” it somehow decided to exclude an entire section for no reason…

Screenshot of ChatGPT giving the total score of 80 out of 90, excluding an entire section but not explaining why

So, I updated the prompt to explicitly state that the total maximum possible score is 100, which seemed to do the trick…

…at least for a bit.

A new problem emerged—while the equation was correct, the sum was blatantly wrong (the total should add up to 78, not 68; it almost fooled me). I made it double check its answer:

Screenshot of ChatGPT

It redid the summation and still gave the same wrong answer (no prose variation by the way ??).

I challenged its answer rather bluntly:

Screenshot of ChatGPT. Prompt says, "but your math is wrong"?

Same apology, same mistake…ChatGPT/GPT-4, you are disappointing me.

Screenshot of ChatGPT. Prompt says, "The total is 78!"?
What use is a good ole ChatGPT if I need to correct errors every single time?

As a user, I wanted to use it to help me save time...but after 4+ hours of wrestling with the prompts and correcting inconsistent errors, it was clear to me that I would be spending more time per assignment since I would have to review ChatGPT’s scoring with mine for every single case. At this point, I gave up trying to use ChatGPT for this task.

Is it GPT-4 better than GPT-3.5?

Yes, it is (although slower in speed). I tried using the same prompts with GPT-3.5 which is the current default for ChatGPT. The error rate was way higher, with hallucination being much more frequent (wordy but incorrect explanations), and the responses were much less detailed than GPT-4.

Also, GPT-4 was able to identify a small mistake in the code that I had failed to catch (e.g. omission of a tag).

One of the things that was visibly different between the answers given by GPT-3.5 and GPT-4 was that the latter was a lot more polite (almost too much so) and apologized way too much. Could this be something to be improved upon?

Last screenshot of ChatGPT, saying "Thanks ChatGPT, for helping me out"?

What do you think? Could I have approached this differently to improve the effectiveness of ChatGPT? Would love to hear any thoughts.


Disclaimer: This post is based on personal experience around a specific use case, and it is by no means attempting to make any scholarly or technical claims on the performance of ChatGPT.

Seth Haberman

CEO at Sense Education,

1 个月

It would be interesting to see how much it’s improved. For a number of things not too much

回复
Saima Tariq Khan

Futurist IT enthusiast with a passion for AI, NLP, and QC innovations

1 年

I came across your article whilst searching for how others have fared in this area. Thanks for sharing! I'll put together my thoughts too. ??

回复
Christian Kaas

Co-Founder @ Vectice | Putting Model Documentation on Steroids

1 年

Very interesting read. Fun exercise! Surprising that the model remains bad at math, given it’s one of the earliest abilities to emerge in complex LLMs. But also not surprising - the core of the model is still statistical token prediction. Without a “source of truth” for mathematics, I doubt LLMs will rival calculators any time soon.

要查看或添加评论,请登录

Eunhae L.的更多文章

社区洞察

其他会员也浏览了