Reasonings found in a bathtub

Reasonings found in a bathtub

Since the end of 2024, the latest evolution of Large Language Models is dominated by so-called Reasoning models, with DeepSeek R1 the most recent entry causing a $1 trillion loss on the stock market last week and dominating the app store.

The other most popular entry is OpenAI's o1 (to be soon succeeded o3 which is still reserved to the lucky few).

It appears that these models are trained on more than the the Internet data we human simpletons have concocted since the creation of the web. AIs are used to generated lengthy test data illustrating how to go step-by-step down the path of deduction which necessarily provides the backbone of argumentation.

These "reasoning" models are reaching new heights in benchmark scores, and with the support of agentic tools and extensions, we once again hear the clamor that humans will shortly be replaced.

However, Sam Altman himself rapidly dispelled the notion that this was AGI - Artificial General Intelligence that could match the level of humans and perform complete jobs.

Experts have concluded that a new law of AI has emerged: If you give LLMs more time to "think", it will produce better results. And indeed - these models tend to be slow, pondering for tens of seconds, and producing massively verbose intermediate "reasonings" and outputs.

The models produce lengthy reasonings, then, though it can be argued they don't possess reason! Robert Heinlein once wrote that Ph.D. meant "Piled Higher and Deeper", and sometimes these reasoning models seem to be focusing on quantity rather than quality.

Hence Google's recently-released Gemini 2 "Flash" that blurts out massive paragraphs of introspection at very high speed, albeit at times sounding like Robin Williams in "Good Morning Vietnam!".

A small fraction of output produced by a 5-line prompt

Anthropic is a laggard in that they don't have an official "Reasoning" model. Claude Sonnet 3.5 isn't a "Reasoning" model, though it frequently seems smart! It's a lot less roundabout when it comes to creating documents and code files, so it tends to be my goto AI coding assistant. It's still the best at debugging and I'm in the process of building an MCP extension letting it access the NodeJS debugger, so I'll have more to say about that shortly. However, much to my surprise, I found that Gemini Flash 2 was actually better at creating and reviewing detailed design specs, even though Sonnet was already pretty good. More on that in an upcoming episode.


Fun fact: Perplexity now integrates with Claude Sonnet (for its "Pro" mode), but also gives the choice of using its own hosted DeepSeek R1 or OpenAI o3-mini, so you can get a taste of how the different reasoning models are starting to shape up.

(o3-mini isn't the full o3 "Reasoning" model. I tried to use it for coding but I find it's only useful for small-scale generation and analysis - hence the "mini" is very apt.)


While others were using OpenAI's realtime voice mode to emulate a pirate's voice, or hysterical horror movie screams, I decided to put it to use when taking a bath. I usually that the time "vedge out", but other times I have inklings and questionings that come to me Archimedes-style and that's when I truly appreciate OpenAI's voice interface.

I got started through a desire to optimize my time and work on my LinkedIn articles while soaking. From what I could tell, all the content creators were using AI to enhance their productivity so I figured I needed to get with the program to achieve the necessary cadence.


Thanks for trying

I realized the acoustics of the bathroom were against proper transcription, especially as I had positioned my phone several feet away from the tub. I took a chance with picking up the phone so I could properly orient it, daring to bring it closer to the deadly (for it) waters, increasing the risk that, in a moment of clumsiness, it would join me in the suds and force me to acquire a newer model, as I hard learned the hard way once upon a trip to Sarasota.

My phone was in my pocket. It only worked a few seconds after this before remaining silent forever.

Nevertheless, such was my interest in maximizing my time that I took the chance and held my phone as near the sudsy water as I dared, with a firm grip, and proceeded apace.


This was using the so-called "Advanced" voice mode.

The previous "Not-so-advanced" voice mode would sometimes drone on without possibility of stopping the rambling. You had to close the app to get it to stop - kind of tricky when your hands are soapy.

Now we finally have "real time" allowing you to interrupt, like being on a zoom call with your colleagues! Unfortunately it also can interrupt you - just like real colleagues! Much better adapted to the bath environment.

Reminds me of NotebookLM whose AI-generated hosts are constantly interrupting each other in a fairly effective though tiresome round-table simulation effect. Impressive but a bit too high intensity for bathtime.

I continued with my dictation, hoping it would be less "in my face".

I was perplexed at how often it was interrupting me, which is why I did that sound check - maybe something in the way the microphone was positioned? I was holding it just outside of the tub itself so that if I dropped it, it would only be on hard tile and not directly in the drink.

You may have noticed I was use a little Prompt Engineering trick here - telling it how important each word was, so it wouldn't want to interrupt me. Of course words are a dime-a-dozen so this is just a put-on but it's important to learn these tricks when working with SOTA ("State of the art") models.

I also tried to lay on a guilt trip by letting it know it disturbed me when it talked while I was doing my thing. These LLMs are trained to be considered "helpful assistants".

Part of what I was saying couldn't be transcribed due to the PG nature of the application. A moment of weakness as I was starting to feel a bit of desperation as to getting anywhere.

One good thing I can attest to: ever notice that, in all of the "voice mode" demo videos, the guy never lets ChatGPT finish what it's saying? Kind of rude since he just asked for that information. Anyway I can confirm at least that feature actually works and I was glad to use it. Definitely a big improvement over the "regular" voice mode and I made a good use of it as I got on with my LinkedIn post.


Fascinating the endless variety of interruptions ChatGPT can come up with!


Finally I had enough of the constant banter and decided to shut it down for now.


Many hours of fun

One day, having at my disposal the power of a brain trained with the entire Internet including movie screenplays, tv plots, and open source books as fodder, I was curious to see if it could channel some novel and captivating tales tailored to my tastes as they say, simply by choosing the right prompt. Who hasn't imagined being a world-builder like J.R.R. Tokien, G.R.R. Martin, J.Y.K. Lee or DBC Pierre? All you need are three initials and ChatGPT.

A classic mystery trope: the elusive spike in traffic!
Hanging on the edge of my seat
My apologies to my boomer friends, any resemblance to anyone I know is just a coincidence

And so on. They say to write about what you know, but in the end it wasn't as gripping as you might have thought.


One day as I was submerged in the hot liquid almost in a transe, I wondered - how does an LLM perform additions on arbitrary numbers? Surely, it can't be because in the Internet data it was trained on, all of the numbers can be found? Would it mean that, for very big numbers, an LLM would have trouble adding 1? Luckily I keep my phone near the tub just in case inspiration strikes for my next Linked article.

I activated GPT 4 voice mode and tried to come up with a big number to put it to the test:


Who better than an LLM to explain its architecture, right? Well, pretty much a lot of people actually, LLMs are kind of hand-wavy unless you get very specific. Nevertheless, I figured I could finesse something interesting out of it using some "outside the box" tacks.

After finally getting my question out, we got to the heart of the matter:


To be honest, asking the right question was proving to be a challenge (and not just because of the constant interruptions!)

At that point, I decided to focus on the suds and let the mystery of adding 1 via billions of operations for another day.


I came back to it during another soaking session:

By default, it was bombarding me with concepts faster than my brain could hang on in my relaxed state. However, after asking it to go slower, it seemed to give me the silent treatment. After a dozen seconds of silence (giving me, a slow human, an opportunity to catch up??) it came back and we explored some concepts:

To be honest - I was a bit surprised that it didn't tell me my matho-babble didn't really mean anything, but it seemed to grok what I meant and I was satisfied that I had put my bathing time to good use. My mental model at that point was inspired by trellis encoding of dial-up modems

The way bytes were superposed onto the same signal in order to achieve maximum acoustic throughput was an analogue of how I saw the massive matrix operations on the simultaneous transformer inputs, so I was trying to confirm it with ChatGPT.

Flattery will get you everywhere!

One day I saw a pundit on Youtube advancing that AI translation algorithms had greatly benefitted by the discovery of the Aymara language, which was reputed to have such a singularly mathematical nature that it was probably the creation of a lost advanced civilisation. Later, as I soaked in steaming hot water, my curiosity about the Aymara language and its link to AI and ancient lost civilizations came to me.

The vlogger had said that the language was organized in a series of matrices which had me puzzled.

I started slow - I didn't actually want to learn to speak it, just to grasp this notion that it was freakishly amenable to AI and therefore likely the product of the lost civilization of Mu or other atlantean peoples.

And soon I was trying my hand at constructing some Aymara sentences:

Eventually I got back to my original curiosity, which was how the language was reputedly organized algorithmically making it suitable for AI:

As a long-time observer of the "Ancient Aliens" grift, I should have know that the rumors of lost-civilisational parlance were likely exaggerated, however I did find this little exploration amusing.


As fun as this was, I assumed o1 when it came out would give me an enhanced entertainment experience. I was not disappointed.

I decided to revisit the infamous Monty Hall paradox, that I had written about earlier in my exploration of AI capabilities. At the time I was kicking the tires of GPT-4 - seems like ages ago...

In that episode, I had gotten to the point that GPT-4 admitted there was something fishy, but it wasn't able to turn its back on the collective misguided opinion that, Monty Hall having opened a door suddenly made it more likely that the other door was the one with the car.

I decided to see if o1, the most powerful model when it came out a few weeks ago, would be able to get to the bottom of this.

The key element in the commonly-held explanation of the Monty Hall problem is that Bayes' theorem applies: the act of opening a door reveals information that alters the probabilities of winning the game.

However, in my first attempt, the "Law of Total Probability" had nearly pushed GPT 4 over the edge, so I decided to have o1, Ph.D. in-a-box, confront the two.

After 18 seconds of cogitation, I got this cop-out:

So basically the same old handwaving that most know-it-alls parrot, except - longer, MUCH longer!

Finally!!!

A "beautiful example" of a snowjob, yeah! Admittedly, this o1 was brilliant in its output, though it was the same old baloney. I was annoyed and decided to push it to recant.

I decided to use my table with the 8 outcomes - surely the visual evidence would make it undeniable!


I'll spare you the overblown bla bla bla trying to deny the evidence of my eyes with some high-quality hand waving.

The ruthless superhuman Apparent Intelligence was ready for me at every turn, though it did have to spend 24 seconds to come up with a simple answer to my question about the number of rows - most of that time likely spent coming up with a devious argument to refute the implications of my question. Thinking 5 steps ahead!

I made an effort to convince it that each row had the same probability but it would refute this with a variety of tricky arguments too long to reproduce here - I will just include some choice cuts:


Disinformation confusion: flooding the zone with specious arguments

I try to fight back by calling out the irrelevant nature of the arguments


Now we got into the crux of the argumentation - that the information revealed by opening a door partitions the resulting set of possibilities and reduces the probabilities. This seems perfectly reasonable and mathematical in general, though it's actually nonsense in this specific case:

It seems that there's no end to its ability to generate various types of argumentation no matter what arguments I give it or indirect questions I take through side questions. The training given to these reasoning models is of a higher-level of sophistry than we saw before. No matter what argument I made, it came back with a miniature essay proving me wrong, in ways I couldn't even imagine!

The old "Not a Tautological Fiat" argument

Now it was mocking my argument as "naive", employing my human emotions and fragile ego against me!

I was starting to think that humanity would once again be defeated.


Lots of bold here in this "Final Answer"

Again a slick argument, with just the right amount of hand-waving to come back to its predetermined conclusion.

Not making headway by directly tackling the probability distribution, I decide to try to make an analogy with distributions when throwing dice, but - no dice!


Unshakeable faith in the classic answer

I decided to focus on getting it to admit that no actual usable information is revealed by the host, which invalidates Bayes' theorem, as it is based on using new information to narrow down a set of probabilities.

bla bla bla... "I see where you're coming from" it says, trolling me, guessing at the eventual implications of my point.

I can see a chink in the armor - the "probability of accidentally revealing the prize" is just a hallucination that has nothing to do with the problem! This is a good sign for humans as it is clearly in bad faith so I'm guessing it's run out of "reasoning"!

Now I can make the point - frequently interrupted although the transcript doesn't show it - that the information revealed by Monty Hall is only things we already knew - that one of the remaining doors has the prize, one has the goat. So no new pertinent information is revealed. We're in the same position as someone who joined us at this time so why would we have a different possibility than them.

Now we see how o1 is trying to gaslight me - that I'm letting my "feelings" affect my judgement. A lot has been said about AI's capability to convince people and this is a great demonstration of that capability.

Notice how it says that the following information is supposed to affect the odds of picking the right door, when in fact this is nonsense:

  • "Monty doesn't choose randomly" - so what?
  • a newcomer thinks the situation "started with two doors", while I "have context" that "the situation began with three doors" - again I say: so what?

These are statements that resemble sound argumentation but are actually hollow. I press my advantage:


Now I tell it that because there's no new relevant information, we can't use Bayes Theorem:


I throw in a curveball: an explanation that it can't refuse, namely that humans are not very good at probabilities, which is has to agree to.

Interesting how now the answers are coming back rapidly - no long reasoning time at this point.


I thrown in another punch: the origins of the commonly accepted solution to the Monty Hall problem making it likely that humans collectively accepted a wrong explanation:

Typo: actual name is Vos Savant - a prophetic name if any

Now I hope to give the "coup de grace" cherry on top of the sundae by reminding it that its answers are conditioned by the erroneous training data in which there are thousands of bloggers repeating the same incorrect story about Monty Hall:

I felt certain that my argumentation would convince it, but its LLM nature remained conflicted, not unlike my prior experiment with GPT-4. Disappointed, I had a human-level intuition: let's try to find an equivalent in the financial world and see if there is evidential data that can give a definitive argument.

I ask it to come up with a scenario in which a broker - playing the role of Monty Hall - tries to sell three difference securities - the doors - to a customer - the contestant.


"We don't disclose the junk up front, but we ensure that you're left with the most attractive investment choices." o1, remind me not to ask you to help me with sales pitches!

So we go through a few variations on this. Finally, we choose the following scenario:

A brokerage is selling three securities. All three look solid from the customer point of view. Two of them are in fact solid investments with minimal margin - "sure things" that the brokerage is reselling with very little markup. The third also looks good but the brokerage has secret AI research that allows it to know that it's actually a dud, and they were able to use that to negotiate a good price, so if they sell this investment, they make a lot more money.

So 2 out of 3 times, they essentially break even, but the third time they make a big profit.

This is a reverse Monty Hall where we would have two prizes and one goat, but the principle remains the same - whatever the possibility.

At first, the customer is given the choice of all 3 securities. Then, the broker comes up with an excuse to remove one of the investments that is low margin, leaving a low margin and high margin investment on the table. (For example he can say it was just sold to another customer.)

Using the commonly accepted Monty Hall logic, the client is most likely to buy the high margin (i.e. worthless) investment if he switches, so the broker comes up with a song-and-dance to make the client want to switch. For example, he could explain that it's a little known trick based on the Monty Hall challenge, that only very savvy investors would know! A lot of investing advice is like this. As a result, the brokerage manages to sell a lot of high-margin bad investments.

Supposing this were true, then it would have become a commonly used sales tactic at one point, but - and o1 has to admin this - no one has ever used the Monty Hall technique to make money. It's not proof but merely compelling evidence that is it highly unlikely to have worked.

Success! I take my victory lap:

Now I know AI is just Apparent Intelligence, a simulation, not unlike having an NPC in a game discuss things with you.

Yet the simulation is so advanced that the conversation went from impersonating a cocksure close-minded boor finding bad faith arguments to defend its position, to an exploratory mode where it was partially convinced but willing to explore the idea, to finally being converted to my reasoning.

Nice to have the last word, but kind of unsettling to what extent AI now exhibits a very convincing appearance of reasoning, and how well-equipped they are to provide us with the gift of the Blarney stone! (and charge us an arm and a leg due to excessive token usage!)

And if today I was able to best it in this debate, I'm not sure the same will be true when I go for the 2 out of 3 with o3!


Sometimes called "the LLM whisperer" (by LLMs), Martin Béchard enjoys a vigorous debate with the latest AIs. If you need help convincing your Agents or Coding Assistants to do what they should, please reach out at [email protected]!

要查看或添加评论,请登录

Martin Bechard的更多文章

  • ClaudePS: A Prompting Tool for Claude Sonnet

    ClaudePS: A Prompting Tool for Claude Sonnet

    If you are, like me, an extensive user of Claude Sonnet 3.5, you create multiple projects, each having dozens of…

  • Architecting a Queuing Solution With Claude Sonnet 3.5

    Architecting a Queuing Solution With Claude Sonnet 3.5

    The other day, I did some Yak shaving. I had a little problem which, upon reflection, turned into a big problem with…

    2 条评论
  • Developing with Anthropic MCP (Part 1)

    Developing with Anthropic MCP (Part 1)

    Anthropic has just released the Model Context Protocol and a new version of Claude Desktop as a new way of integrating…

  • Cline - New (Old) Kid in Town

    Cline - New (Old) Kid in Town

    There's a new AI Codeslinger in town called Cline. Born ClaudeDev, Cline got a name change for marketing reasons.

  • Perplexity vs. OpenAI: Battle of the AI Search Titans

    Perplexity vs. OpenAI: Battle of the AI Search Titans

    Earlier today I saw that OpenAI posted on LinkedIn that it had released its much-vaunted "AI Search" which had been in…

  • Building Swarm-JS (Part 1)

    Building Swarm-JS (Part 1)

    Recently Anthropic released Swarm, an "Agentic" open-source framework in python. As the README says: An educational…

  • Putting the "New" Claude Sonnet 3.5 through its paces

    Putting the "New" Claude Sonnet 3.5 through its paces

    I was recently hitting the limitations on Claude Sonnet's output on a regular basis, as part of getting Claude to…

    1 条评论
  • Perplexity: Secret Agent Man

    Perplexity: Secret Agent Man

    Perplexity, the leading AI search engine that is becoming the new Google for AI-savvy searchers, is getting on the…

  • Anthropic Claude's Computer Use Demo is Worth Seeing

    Anthropic Claude's Computer Use Demo is Worth Seeing

    Anthropic just released a new version of Claude Sonnet 3.5 with "Computer Use", intended to allow Claude to take…

  • An (AI) Diagram is Worth a Thousand Words

    An (AI) Diagram is Worth a Thousand Words

    If like me you've been using Claude Projects for text analysis and generation, you will rapidly create multiple related…

    4 条评论

社区洞察

其他会员也浏览了