GPTs "reasoning" is complex, but may lead to blunders.
OpenAI's GPT models, known to most through the ChatGPT, are incredible tools. So are other LLMs. They are already having a staggering effect on the tech world and will be even more impactful in the very near future. That said, it's not magic. The algorithms behind these tools, to take a page out of Arthur Clarke's book, may rival magic in their complexity but they will have design trade-offs and limitations. The ins and outs of LLMs are beyond the scope of this article, though I suggest you check out this visualization. What I am going to present here are some examples!
Let's start off with something simple! ChatGPT (my tool of choice) is notoriously bad at TicTacToe. Here is a recent game - I let it go first:
Well, that was easy. The tokenization of TicTacToe results simply doesn't allow a GPT to be competitive. To clarify - it's relatively easy to write an AI to be unbeatable at TicTacToe, just that LLMs are not really the tools for that.
Another well-known issue is that image generating LLMs are notoriously bad a crating text:
Again, it's an immensely capable system, but simply due to the nature of training data and how images are generated it's no good for texts, even Jingle Bells (something it would have had examples to train on!).
But what about lateral thinking? "How many words would your next answer contain?" A hard task, it seems at first, as you would need to tailor your sentence to the ever increasing word count... Unless you say "One". That way the answer actually addresses the question, and has the same number of words.
To be fair, when you clarify the question, it will tell you "One". Pushing the interpretation and reasoning tests further, I tried a prompt from Sparks of Artificial General Intelligence: Early experiments with GPT-4 about stacking objects: "Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner." The reply was puzzling (and for once, no pun intended!).
While this puzzle relies on common sense, and because of that brakes some of the rules of good prompt engineering, it indicative of what may happen if you treat prompting like a conversation with another human. LLMs don't have an intrinsic "common sense" only one they derive from training data and your request may not match up to that. No sane human would even try to stack eggs and certainly not as a last step after the nail has been placed. Well, there is an even clearer example - the Wason Selection test.
The Wason Selection test, aka the Four Card problem, is a test of conditional reasoning where a direct and an "opposite" approach need to be taken to reach the correct conclusion. This Medium Post provides a good example.
领英推荐
Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?
The trick is to check the cards that are multiples of 4 but also to check the cards with red opposite side as well, as this could falsify the hypothesis.
When reminded that the red card might be a multiple of 4, ChatGPT conceded the point.
Lastly, let's see how ChatGPT deals with Humor. I gave it this as an image input:
I don't think I need to explain the modification to the classic Train Cart Dilemma. Some might find this darkly humorous, some distasteful and the Nihilists may find that philosophical. ChatGPT saw it as the original dilemma.
Obviously a looping track would simply makes this dilemma an exercise in futility. This time, however, ChatGPT still did not understand what looping tracks would mean for this picture:
Well, there goes my ida of GPT Health and Safety advisor!
GPTs are great and I encourage everyone to use them, but please don't use a "black box" approach and just assume that everything that comes out of it is perfect!