I tried using ChatGPT (GPT-4) to grade coding assignments and this is how it went
GPT-4 was released just a few days ago, and like many others, I watched the demo and was very excited about its new potential capabilities and applications.
Then, I had the opportunity to experiment with grading coding assignments using ChatGPT, powered by GPT-4. I was hyped about the idea and had high expectations that it could help me save some time.
Spoiler alert: I decided to ditch it after a few attempts. Goodbye high-tech, hello analog.
The process
The results? Well, here’s a few things that I learned…
Scoring was inconsistent even with the same prompt
I ran the same prompt with the same code several times to see if the scoring was consistent. In fact, the scores seemed to change every time! So I confronted it:
ChatGPT apologized and re-did the scoring by basically swapping out the incorrect answer with my “feedback” and failed to answer WHY it made the mistake, which is what I wanted to get at.
It takes in your feedback…too well
In one of the assignments, the image was inserted in the code but was not rendering correctly which should have been reflected in the scoring. ChatGPT initially gave a perfect score for the image portion (20/20), so I wanted to probe further:
Without any questioning, ChatGPT simply accepted my “feedback” and changed its scoring, but came up with a strange reason (“the space in the URL”) which wasn’t the actual issue.
I challenged its answer by asking to indicate where exactly the problem it mentioned was located:
It simply took back the reason that it proposed, and reverted back to the initial score that it had assigned in the beginning—which, by the way, was still incorrect. No logical explanation, no standing ground based on its own observation…just being wishy-washy based on my comments. At this point, I had lost some faith in the ability of ChatGPT to be a competent collaborator in this task.
It can give you feedback on how to ask better questions
Then I took a more reflective approach—I asked WHY it made the mistake and asked to shine some light on how I can revise the prompts to reduce errors:
Its apologies now started sounding like a broken record to me and made it seem (even) less human.. Putting that aside, the feedback that it gave regarding the prompt was quite insightful. Based on the feedback, I updated my prompt to include the sentence ”Make sure to double-check each section of the rubric against the code and assign a score for each component.”
And everything seemed to get better…
…except it didn’t. While it sounded good in theory, the results did not improve dramatically. It continued to produce errors that didn’t seem to have an obvious pattern, even with the exact same prompt.
领英推荐
(Could I have been able to see a pattern if I observed it long enough? Maybe, but most people don’t have the time and patience.)
It’s surprisingly bad at basic math
Another unexpected challenge was summing up partial scores to get the final score.
The rubric contained 8 sections with partial scores that summed up to 100 points in total. Yet, while the prompt asked for “the total score out of the total maximum possible score,” it somehow decided to exclude an entire section for no reason…
So, I updated the prompt to explicitly state that the total maximum possible score is 100, which seemed to do the trick…
…at least for a bit.
A new problem emerged—while the equation was correct, the sum was blatantly wrong (the total should add up to 78, not 68; it almost fooled me). I made it double check its answer:
It redid the summation and still gave the same wrong answer (no prose variation by the way ??).
I challenged its answer rather bluntly:
Same apology, same mistake…ChatGPT/GPT-4, you are disappointing me.
As a user, I wanted to use it to help me save time...but after 4+ hours of wrestling with the prompts and correcting inconsistent errors, it was clear to me that I would be spending more time per assignment since I would have to review ChatGPT’s scoring with mine for every single case. At this point, I gave up trying to use ChatGPT for this task.
Is it GPT-4 better than GPT-3.5?
Yes, it is (although slower in speed). I tried using the same prompts with GPT-3.5 which is the current default for ChatGPT. The error rate was way higher, with hallucination being much more frequent (wordy but incorrect explanations), and the responses were much less detailed than GPT-4.
Also, GPT-4 was able to identify a small mistake in the code that I had failed to catch (e.g. omission of a tag).
One of the things that was visibly different between the answers given by GPT-3.5 and GPT-4 was that the latter was a lot more polite (almost too much so) and apologized way too much. Could this be something to be improved upon?
What do you think? Could I have approached this differently to improve the effectiveness of ChatGPT? Would love to hear any thoughts.
Disclaimer: This post is based on personal experience around a specific use case, and it is by no means attempting to make any scholarly or technical claims on the performance of ChatGPT.
CEO at Sense Education,
1 个月It would be interesting to see how much it’s improved. For a number of things not too much
Futurist IT enthusiast with a passion for AI, NLP, and QC innovations
1 年I came across your article whilst searching for how others have fared in this area. Thanks for sharing! I'll put together my thoughts too. ??
Co-Founder @ Vectice | Putting Model Documentation on Steroids
1 年Very interesting read. Fun exercise! Surprising that the model remains bad at math, given it’s one of the earliest abilities to emerge in complex LLMs. But also not surprising - the core of the model is still statistical token prediction. Without a “source of truth” for mathematics, I doubt LLMs will rival calculators any time soon.