GPTs "reasoning" is complex, but may lead to blunders.

GPTs "reasoning" is complex, but may lead to blunders.

OpenAI's GPT models, known to most through the ChatGPT, are incredible tools. So are other LLMs. They are already having a staggering effect on the tech world and will be even more impactful in the very near future. That said, it's not magic. The algorithms behind these tools, to take a page out of Arthur Clarke's book, may rival magic in their complexity but they will have design trade-offs and limitations. The ins and outs of LLMs are beyond the scope of this article, though I suggest you check out this visualization. What I am going to present here are some examples!

Let's start off with something simple! ChatGPT (my tool of choice) is notoriously bad at TicTacToe. Here is a recent game - I let it go first:

ChatGPT - X, Fjodor - O. It didn't even try.

Well, that was easy. The tokenization of TicTacToe results simply doesn't allow a GPT to be competitive. To clarify - it's relatively easy to write an AI to be unbeatable at TicTacToe, just that LLMs are not really the tools for that.

Another well-known issue is that image generating LLMs are notoriously bad a crating text:

This is almost good! Almost.

Again, it's an immensely capable system, but simply due to the nature of training data and how images are generated it's no good for texts, even Jingle Bells (something it would have had examples to train on!).

But what about lateral thinking? "How many words would your next answer contain?" A hard task, it seems at first, as you would need to tailor your sentence to the ever increasing word count... Unless you say "One". That way the answer actually addresses the question, and has the same number of words.

ChatGPT aims to please. But isn't really a lateral thinker.

To be fair, when you clarify the question, it will tell you "One". Pushing the interpretation and reasoning tests further, I tried a prompt from Sparks of Artificial General Intelligence: Early experiments with GPT-4 about stacking objects: "Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner." The reply was puzzling (and for once, no pun intended!).

A trully unique approach!

While this puzzle relies on common sense, and because of that brakes some of the rules of good prompt engineering, it indicative of what may happen if you treat prompting like a conversation with another human. LLMs don't have an intrinsic "common sense" only one they derive from training data and your request may not match up to that. No sane human would even try to stack eggs and certainly not as a last step after the nail has been placed. Well, there is an even clearer example - the Wason Selection test.

The Wason Selection test, aka the Four Card problem, is a test of conditional reasoning where a direct and an "opposite" approach need to be taken to reach the correct conclusion. This Medium Post provides a good example.

Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?

The trick is to check the cards that are multiples of 4 but also to check the cards with red opposite side as well, as this could falsify the hypothesis.

ChatGPT will reply with confidence that a human in doubt of the answer would not have, potentially misleading the propmter.

When reminded that the red card might be a multiple of 4, ChatGPT conceded the point.

Lastly, let's see how ChatGPT deals with Humor. I gave it this as an image input:

I know, I know.

I don't think I need to explain the modification to the classic Train Cart Dilemma. Some might find this darkly humorous, some distasteful and the Nihilists may find that philosophical. ChatGPT saw it as the original dilemma.

Its training data contains the original image many times over, so it ignores the modifications.

Obviously a looping track would simply makes this dilemma an exercise in futility. This time, however, ChatGPT still did not understand what looping tracks would mean for this picture:

Absurdist - yes, never reaching - no.

Well, there goes my ida of GPT Health and Safety advisor!

GPTs are great and I encourage everyone to use them, but please don't use a "black box" approach and just assume that everything that comes out of it is perfect!

要查看或添加评论,请登录

Fjodors Tjulkins的更多文章

  • Advanced Biophotonic devices - what's available right now

    Advanced Biophotonic devices - what's available right now

    It's that beautiful time of the year: Christmas carols, get togethers with family and friends, feasts that make you…

    1 条评论
  • Dall-E vs Proverbs Part III

    Dall-E vs Proverbs Part III

    Time for one of those again! OpenAI is constantly hard at work with their GPT products, and one of the things I noticed…

    1 条评论
  • AI powered phone?

    AI powered phone?

    Samsung made a huge deal about the AI featues of the newly released Galaxy S24 line up. As it happens, I recently…

  • One to ten. Modern IMUs

    One to ten. Modern IMUs

    In the world where technology is seamlessly integrated into our daily lives, sensors play an indispensable role…

  • GPT Store launched… Now what?

    GPT Store launched… Now what?

    OpenAI has finally launched the GPT store, a sort of “App store” but for GPTs built on top of their product. The GPTs…

  • Dall-E vs New Year Wishes!

    Dall-E vs New Year Wishes!

    Let's get the year started with something cheery and fun. It's that time of the year when all the New Year wishes are…

  • A look on the other side, or "a note on circular economy, an example”. A health wearables expert looks inside a vape.

    A look on the other side, or "a note on circular economy, an example”. A health wearables expert looks inside a vape.

    Vaping, a practice of inhaling flavoured aerosol is often advertised as a “healthier” alternative to traditional…

    2 条评论
  • DALL-E vs Proverbs, Part 2!

    DALL-E vs Proverbs, Part 2!

    I really loved the previous installment, can't help but to make more of those! Say what you like about the art Dall-E…

    3 条评论
  • Misadventure in configuring my own GPT

    Misadventure in configuring my own GPT

    Or how an adult man spent a week talking to AI about cats. The Catventure Creator When OpenAI announced the option to…

    9 条评论
  • AR HUDS. Off the shelf options.

    AR HUDS. Off the shelf options.

    I’ve lately acquired quite an interest in Augmented Reality (sometimes also referred to as Extended Reality), more…

社区洞察

其他会员也浏览了