Reasoning AI Coding Bakeoff - Part 1 of 3

Martin Bechard

Consultant | Problem Solver | Software Developer | Technical leader

发布日期: 2025年3月8日

The other day I was asked "Hey good lookin', what's cookin'?", something I haven't heard as frequently as I used to. (Perhaps time for a haircut? and in case you're wondering, the AI-generated image of a programmer at the top is always bald because that's what you get most of the time when you say the programmer is over 50!)

In case you've been wondering, I've been a bit too busy to write much as I've been heads down on a project featuring AI Coding, turning to - you guessed it - Claude Sonnet to help me with my development. I've been working on MCP extensions for Claude that puts it in the driver's seat and is quite a game changer. But since it's still a work in progress that story will be for another time.

I was therefore using the Cline extension in Visual Studio with the Claude APIs. I had a few days where the totals started really adding up, pushing me to consider alternatives.

This led me to experiment with the latest slew of "Reasoning" AIs including Google Gemini 2 Flash, OpenAI o1 (release edition), o1 mini (I heard it was pretty good though not technically a reasoner), and DeepSeek Chat and R1. Then as I was writing this article, we got Claude Sonnet 3.7, o3 mini "high", and a new Mistral. To be honest, not all of these new models are "reasoning" models, but they approach it.

I gave them all the same challenge: generate some code from specs, generate test plans and unit tests, run the tests, and fix the bugs.

Considering that I was racking up a decent bill with Claude, I decided to create the specs themselves with Google Gemini 2 Flash as I could use it for the low low price of $0! That's right - gratis! Free! There's a usage limit but I hardly ever hit it. And though Claude is still the most coherent and astute Coding assistant, Gemini did a fine job with the specs.

There was a lot to consider and learn from this experiment, in which I really went though the full cycle

Your mission, should you choose to accept it

The challenge I chose for my AI friends was to create a two-way serialization module from JSON objects to and form Markdown files.

This module was not too big that it would break the bank as I went repeatedly through the full development cycle with each model, yet it was sufficiently complex that bugs could easily get in especially as it had to be flexible enough to handle diverse input structures.

This is the kind of project that sounds simple enough it's tempting to just kick it off with a prompt, then add more prompts until you should have the whole thing done but instead you have thousands of lines of useless "AI Spaghetti Syndrome" Code (commonly known as "A.S.S. Code" among human coders).

Guardrails-Driven Development

This is where the Guardrails come into play:

have a reference and concise description of the processing and tell the AI to go back to it to resolve questions and identify issues
create unit tests with mocking to ensure each function is tested on it own and any regression immediately found
create integration tests that use only real functions to ensure the end-to-end processing continues to works i.e. a unit could work on its own but become broken in its interaction with other units.

The AI will perform the following tasks:

create a design
validate and correct the design
generate code from the design
generate test plans from the code (white box testing to achieve full code coverage)
generate unit and integration tests from the test plans
run the tests, diagnose and fix failures.

For each step and each model, we'll look at the cost, code quality, and ability to resolve problems in the code.

Build one (or two, or three...) to throw away

I decided to start with Gemini 2 Flash as it was the only free reasoning model - it would make the repeated attempts less painful. Plus I was curious to experiment with its somewhat quirky experimental reasoning. Contrary to its rivals, Gemini Flash is an open book, sending back a torrent of "thinking" to keep me entertained while waiting for the query to finish. (I tried the non-Flash Gemini, a.k.a. Gemini "Slow" - it didn't do the work better than flash, and it was slow!)

I had a mega prompt that described how to create a spec, which I had developed with previous projects. I would start by describing what I wanted it to do, then the AI would generate various artifacts such as class diagrams, sequence diagrams, processing flowcharts. Then I would get it to generate a list of files, attribute functions from the diagrams into the list. Finally I would generate a spec for each file.

However...

I decided to basically let it loose (known as "vibe coding") considering its propensity to be verbose and fill in the gaps. I just gave it a number of examples to illustrate the transformations that it needed to implement, in both directions.

I was pleasantly surprised at how Gemini was very good at intuiting the necessary steps to accomplish the examples without the diagrams and descriptions that I had used previously. I introduced a requirement to systematically explain each business rule and design decision with a "5 WHYs" approach. I changed it to "5 BECAUSE" so it would read better.

 IF markdownText is empty THEN
        RETURN empty array
            BECAUSE we need to handle the case where the input is empty
    END IF
    Initialize sectionStackManager to new SectionStackManager
        BECAUSE we need to manage the section stack
    Create inputLines object with lines from markdownText and consumed = 0
        BECAUSE we need to track the consumed lines
    WHILE inputLines.consumed < inputLines.lines.length
        Call parseSection with inputLines
            BECAUSE we need to parse the section
        Set newSection to the result of parseSection
            BECAUSE we need to store the parsed section
        Call sectionStackManager.addSection(newSection)
            BECAUSE we need to add the section to the stack
    END WHILE
    RETURN sectionStackManager.getTopLevelSections()
        BECAUSE we need to return the top-level sections

Having created the design, I used previously developed prompts that can take the specs and create code from them. Then I used another prompt to create tests.

Overall, the different parts of the code were coming together. Much to my chagrin, after applying a few code fixes for failing tests, Gemini started getting very confused and the code became a mess.

This is when I decided to retrofit any fixes back into the design specs and use them as the one source of truth.

Always be updating (the specs)

Humans usually write specs with a higher level of abstraction in order to capture the essence of what the code needs to do without getting bogged down with details.

And when the code is written, "No plan survives first contact with the enemy" so the design doc becomes obsolete and never updated.

I realized that for LLMs this is a risky optimization, even though it remains cheaper to adjust the specs as much as possible rather than do a lot of coding refactoring and pay for the same functions multiple times. This assumes that the functions, test plans and tests, as well as any troubleshooting, are more expensive than creating the specs.

After my first large-scale failure, I went back to my guiding examples. At first I just had a direct transformation from JSON to markdown (shown previously). I decided to make it easier by introducing an intermediate step which was simpler to target from either end of the transformation. This proved to be the right level of complexity for Gemini to create a design which worked with minimal design and code defects.

With my new examples in hand, I DELETED ALL MY CODE! AND ALL MY DESIGNS!! (I had a backup just in case!)

I didn't have to worry because Gemini went through the whole process like a champ, I just had to be patient because it did take time to recreate all the artifacts from scratch!

Reducing scope

Since this was free with Gemini, I repeated the cycle several times until I was happy with the results. I also introduced a design validation phase (described later) to catch conceptual bugs before having created any code. However I decided that all that design generation would add up pretty fast on my credit card with other "paid" LLMS, so I decided to just use the designs created by Gemini for all of them and stick to coding for the comparison.

Perhaps in the future I might do another bakeoff for designs but I see them as a means to an end. I can always turn to Gemini for this part, then hand-off the coding to the LLM du jour that does the best implementation and troubleshooting job.

In any event this still let me compare the coding, testing and troubleshooting capabilities from the same requirements, so I think it's a more deterministic experiment as there might be a lot of variation in designs. Kind of like in the Olympics competitors have to perform the same imposed exercise which give the judges a chance to compare the same moves for each participant.

Gemini's Design

I include here one of the main processes designed by Gemini as an example.

Gemini generated it in one shot without missing a beat.

There's a lot of text, but I was pleasantly surprised that most of the LLMs seemed capable of following all of the details, and quoting them when needed e.g. when troubleshooting and proposing a fix.

There were three other specs to create, but I'm just going to focus on this one for the purpose of this already very long article.

Design Validation

With the spec in hand, I then asked Gemini to trace through the pseudo-code for each of the original examples to see if indeed the proposed algorithm would generate the desired outcome.

For each of the guiding examples, Gemini went line-by-line through the design spec and explain what data changes happened at each step. When done, it would compare the final expected result with what the example required. I ended up with 30 different analysis files, and using about 20 million tokens! But with Gemini, 20 million X $0 = $0! So I hope Google doesn't raise the price at some point because this will become prohibitively expensive.

Explaining line-by-line what the proposed design should do

This line-by-line analysis is a way to detect any conceptual defects in the current design and cheaper to do that generate the code and have the LLM try to debug it. It's well known in software development that the later a stage you fix a defect, the more expensive it is, and that's certainly true of AI coding, as the AI struggles to correct the right thing. Plus you may have specs to update, tests to change or create etc.

The AI finds discrepancies between what the pseudo-code would produce and what the original examples said should be produced.

It then cycles through each discrepancies and fixes the design. And then does the validation again. And again. And it has to regression test the other examples. In short - this is very time consuming, but effectively avoids gaps in the design.

Gemini got so taken by the role playing that it started suggesting possible code fixes and investigation tactics, including logs from an imaginary run! Reminder that at this point, we're only conducting a validation of specs - no code at all!! It even suggested there might be a possible CSS issue!

I had to remind it that we were only working on the design, and it apologized for the confusion.

Nevertheless, though conceptual, a problem was indeed detected, allowing us to fix the spec up front before having any actual code.

In another case, it incorrectly followed the execution flow, but in reviewing the spec, it was able to discover the discrepancy and correct the trace.

Once the code is finally generated and tested, it's always possible and in fact desirable to change things in the specification and update the other artifacts accordingly, so even if not all problems are found, that's not a big deal. However, this verify helps reduce the odds of any major issues which would require a lot of $ to fix when using the paid models.

This verification requires a lot of tokens too but I can do it with Gemini so it's free! And the paid models benefit from a cost reduction as they start with carefully reviewed specs.

In the case of an Agentic developer application, we should expect this kind of validations to be built in, and in truth, humans wouldn't have the patience and stamina to go through all the functions with all the guiding examples to see if their designs are correct. This is definitely an AI forte.

Human - big picture intuition, applies common or novel design patterns rapidly according to the context of the problem and desired outcome

AI Assistant - painstaking, detail-oriented, capable of apply typical design patterns and performing in-depth line-by-line verifications across multiple documents.

The combination of both seems like it would improve the final quality.

Time to Code!

Happy with the verification, it was time for code generation, for which I used the following prompt:

CREATE CODE
Read the document design/parse-process.md. 
Create a file called src/markdown-serialization/parse/types.ts and put in the base types from the file “design/markdown-serialization-examples.md”, with detailed JSDoc comments for each type.
Create a file called index.ts that will export various files and types from the other files in the folder, but not contain any types or functions itself.
Generate code in src/markdown-serialization/parse for any missing function or class ONLY
Create a separate file for each major function or class escribed in the document ONLY if the function or class doesn’t exist
USE ESM imports meaning imports of a file require “.js” at the end
USE the ‘@/‘ alias instead of “..” In import paths.
Add an import { MarkdownSerialization } from "@/markdown-serialization/types.js";
DON’T stop to fix imports because multiple files need to be created to avoid typescript errors
import { Tracer } from "@/lib/tracer.js";
For each function, include input examples with corresponding outputs in the JSDoc
For each statement of the function, explain what design rules apply
For each rule coming from the design document, add a call to Tracer.log that starts with `RULE: <hierarchical rule number>: <summary> : value1=,`value1,` value2=`, value2, where a small summary is provided, as well as the values used in applying the rule. Base the numbering on code block nesting. DO NOT USE separate constants for the messages to keep the code compact.
IMPORTANT: Do not use JSON.stringify on objects because there could be circular references, let Tracer.log do the serialization or call Tracer.stringify as it can handle circular references.
IMPORTANT: NEVER use` replace_in_file`, it doesn't work. always use `write_to_file`
IMPORTANT: CHECK FIRST if a file exists before creating it. SKIP IT if it already exists.

One good thing about these latest AIS, I don't have to explain any typical coding constructs because the models are good enough to figure out most of it.

Single File Single Responsibility

The prompt includes instructions to put each function into its own file. There are several benefits to this:

When doing AI code updates e.g. to fix a bug, I can avoid regenerating other unchanged functions. Of course, AIs often try to use tools that do partial updates but they must read the full file first, and some of the models often get the partial updates wrong.
Having each file contain a single function emphasizes that the unit test suite should be focusing on just that function, which I think helps the AI create more complete tests.
For real production builds, the code would be transpiled and aggregated into a single js file using a bundler like webpack anyway, so I think this is going to be viable in all cases.

It's also the S in S.O.L.I.D.! So all best practices.

Coding With Examples and Explanations

When initially troubleshooting the AI-generated code, I would sometimes ask myself: "What the heck is this? Why is this function being called and what is it supposed to return?" And I was none too pleased as you can imagine.

A lot of developers have complained that the AI generated code 1) has bugs, 2) is not always comprehensible, and 3) is a pain to troubleshoot. To help with comprehension, we can ask the AI to generate comments, but often those comments explain very little.

// Add 1 to the counter
counter +=  1;

// Call the fooBar function
fooBar(counter)l

I've been experimenting with the "5 WHYs" approach, which if you're fortunate you probably got training on when your 5-year old started imagining that you probably know everything there is to know. Each answer brings another question. I use this approach in the generated design and in the generated code.

Why? Why? Why? Why? Why? Because. Because. Because. Because. Because.

The other thing I found annoying is that the AI often gave non-committal function names such as "handleLine" "processSection" etc. That didn't really say what was the result of calling it so much as explain when it was getting called. This can be argued to be a question of style, but when drilling down in a failing AI-generated function, it can lead to head-scratching as you wonder what the heck the AI was trying to do.

For this reason, the prompt instructions include generating one or more examples of inputs and corresponding outputs to make it clear at a glance what ought to be happening.

Tracing

One of the directives tells the AI to trace each line of each function to compensate for not having a debugger. This way as a test is executed, a detailed debugging file is prepared that the AI can analyze as part of troubleshooting. I found that this was an effective way to avoid random guesses as LLMs are wont to do.

The code is not what a human would write and could potentially be a performance problem. But it would be easy enough to disable these functions in a production build, or just strip them out with a simple `sed` script. The result produces a detailed step-by-step record of what happened during the execution, as a way to compensate for the lack of a debugger.

Resulting trace with rules and values used for AI troubleshooting

I found that all of the AIs had, to some degree, the ability to correlate the traces with the test output and the source code, although Claude Sonnet and o1 were better at deducing the error. I also experimented with having the AIs step backwards through the trace to find a point where the expected intermediate values differ from the actual values. I made this part of the "Root Cause Analysis" prompt:

Adding the trace examination in the RCA reduced guessing and hallucianations

Prior to this step, the AIs would habitually get stuck in a loop adding then removing the same fixes over and over again. It was basically Fake It 'Till You Make It and not that useful.

This approach leverages the LLMs' pattern matching to actually zero in on issues.

This shows promise though it's still slower than a human by a long shot. This is why I'm working on a Node debugger for MCP that will allow Claude to access code directly. Not sure if it will be better but there's potential! More on this in a future episode.

Test plans

AIs can be made to scan code files and perform extractions of useful information - essentially a transformation of the code into, for example, explanations or even tests.

Indeed, one of the simplest ways to get a test suite is just to ask for one and the AIs usually have enough training data with unit tests that they spit out something that looks like a test suite.

However, these suites tend to degrade over time as modifications, additions or corrections are done by AI, which defeats the purpose.

Enter the test plan!

AIs are good at following instructions to create specific sections of documents and populate them with useful information from the provided input document (i.e. a code file in this case).

I came up with the following prompt after a few iterations to create the test plan based on code analysis:

CREATE the folder n the folder test/markdown-serialization/parse if it doesn’t exist
In the folder test/markdown-serialization/parse, CREATE a document called <module name>-scenarios.md for each file in src/markdown-serialization/parse SKIPPING types.ts or index.ts, and SKIPPING any existing scenarios document
GO through and document the lines of code to go through the functions in the file and identify the input values needed to exercise all of the code paths. 
FOR EACH conditional statement, explain what input values can directly or indirectly determine which branch is taken
FOR EACH loop, explain what input values can directly or indirectly determine how many iterations should be done
FOR EACH external call, identify the values that need to be mocked to exercise all of the paths after the external call.
ADD a final section of scenarios consisting of:?	- Synopsis
	- Purpose of test
	- Input values
	- Mock values
	- Outcome expectations

First-order correctness

From my days of C programming I've learned (from the immortal P. J. Plauger) to create unit tests that at least provide first-order correctness. This is the principle that you should at least have enough tests to exercise each line of code at least once. In fact, this is only a small subset of all possible paths through the code, so I also add additional negative tests, edge cases etc. but it's a good organizing start. Better than "Vibe coding" the tests.

For this purpose, the prompt to create test plans has the LLM first going through the code, identifying these critical code paths, and identifying the values or class of values needed to cover the full function. Then I have it generate test cases that will ensure these values are covered. It won't generate all possible combinations - just enough to be able to say it ran through the full function.

I then review the plan to ensure the whole approach makes sense, suggesting amendments as required for completeness or correction.

This approach is particularly useful to save on tokens because these adjustments are in fact a lot cheaper than applying the changes on the test suites, in terms of tokens.

To save money, you need to spend money

Modern development teams apply the principle of "shift left" (i.e. building in quality as early as possible in the lifecycle) as a kind of faith, as there isn't usually a metric that can prove the savings - everyone being too busy getting things done than conducting productivity experiments.

However, there is an intuition that the bug fixing that happens while developing is a lot more efficient than letting the bugs go through QA, then having to investigate possibly obscur symptoms, creating tickets, jumping back into code we though was done etc.

With AI this becomes even more obvious, and indeed painful when paying for each token needed to diagnose and correct the issue! So getting sufficient coverage defined in the test plan pays dividends when we get to running and troubleshooting the actual tests.

Confessions of a Mock Miscreant

Full disclosure: as a proud GenXer, I've been known to have witty repartie and stinging one-liners, as growing up in the shadow of the boomers tended to cultivate; yet when it comes to coding, I've never been much of a mocker. (Womp! Womp!)

Interviewer: "Are you a mod or a rocker?" Ringo: "I'm a mocker!"

Mocking, in the modern coding parlance, is the practice of creating fake objects in order to better control the unit test execution with minimal or no dependencies.

In the olden days, programmers would create "stubs" which were pieces of fake code that would simulate the execution of other parts that weren't yet available.

With Test-Driven Development, we need to create unit tests simultaneously with the coding, to capture an external view of the code being developed more specifically than the specifications can. This is coding, this looks like QA, but it's actually detailed design because it defines the box in which the code must fit.

One problem though is that the "unit" of code under development may need to use other modules which don't exist yet, or even when they do, may be difficult to control or have bugs, complicating the validation of the "unit" of code under test.

It's no coincidence that these are known as "unit" tests: they are supposed to validate the single piece of code we are working on in isolation, which creates a tension when that code can't actually complete its mission without other parts of the system.

This is why the practice of "mocking" these external parts became popular, and indeed became unavoidable in the Agile TDD milieu of enthusiasts. Specialized libraries were created to replace the real behavior with a simpler simulation whose outputs can be controlled in the context of the test, like in the following (generated!) code:

The call to "parsePropertyTable" is simulated with the overrideFn utility

However, in my experience, the mocking could add significant time and complexity to the test development and create technical debt over time as other parts of the system evolved. I was therefore guilty of not having much faith in mock objects.

In fact, some of the least productive developers I've known had gone down the rabbit hole of creating simulations of complex aspects of the system including the ASP.Net runtime and other large-scale libraries, to the point that the mocks rivalled the real code in size and complexity. Inevitably these mocks would become untended and rot away until one day, someone removed them from the automated build. Sad!

The Need for Mock in AI Coding

Working with LLMs though, systematic unit tests and mocking have become must haves for me, as automated guardrails and detailed specifications of what was already done by the AI so that, as the project evolves, deviations introduced by the AI can be caught as soon as possible, and even in most cases fixed by the AI.

Indeed, developers hate fixing things that the AI "assistant" breaks, there's something particularly irritating to the claims of the vendors about Ph.D.-level coding when the LLM runs amok like a wild Elon Musk with a chainsaw! For this, I understand and sympathize with developers who say this AI generation is just a time waster.

The creation of these tests, then, seem to represent overhead and delay, which are the same reasons human developers often don't create automated tests or maintain them as the code evolves. Plus we have to pay tokens for these extra artifacts. However, I contend that, for non-trivial development with AI, they are a must.

It starts innocently enough: the training data on LLMs naturally make them want to create unit tests. Should you be in a hurry to "just get the code done" and skip the test suites, the LLM will suggest at a moment least expected to create them, as the training data always includes them. Then when you aren't paying attention, it will blurt out some test suites with various random tests that resemble what tests should look like. You might take a look, but because you're in a hurry, you won't go through the dozens (or soon - hundreds) of test cases that are cryptic and tedious to look at.

Once you have a test suite, you'll get frequent suggestions to mock stuff that require integrated testing otherwise, because of dependencies on other code. If you let the LLM do its thing on its own, chances are the mocks won't work as there are various subtleties to getting it right. Basically you're fixing bugs in the mocks, and paying for every token!

I've had the displeasure of discovering abominations growing in obscure parts of the code over time as the AI would decide it had better solutions to try. Because I hadn't bothered to systematically create the unit tests and mocks, it could be too late to recover the mess of spaghetti code and the only thing to do was to delete it!

So I was going to develop serious unit tests and mocks for this project, proactively and according to test plans that would serve as a reminder of what to do if the test suite got corrupted by the AI.

Putting it all together

The full test plan proceeded from an analysis of significant statements that affected execution flow, as well as identification of all external calls that would need to be mocked in the unit tests (but not in integration tests).

From the coverage requirements, the AI has to come up with input values, mock values (if pertinent) when external functions are called, and output expectations that will be converted into test assertions

The test plan can be reviewed and adjusted either by directing the AI to make changes, or by directly editing.

This plan can be consumed by AI and translated into functions that are easy to relate back to the specs.

Hunting Bugs

Once the code and tests are generated, it's time to run them and fix any problems. I used the following procedure (after some trial and error!):

RUN TESTS for the `<process name>` process
TEST dependencies:
In the folder test/markdown-serialization/<process name>, CREATE a document called test-file-dependencies.
Create a table in it with the unit test files in that folder, sorted in order of least dependencies to most dependencies.

IMPORTANT: perform the following instructions and DO NOT STOP unless the instruction says to WAIT for approval.
DO NOT use the replace_in_file tool, it frequently fails.

READ the file test-file-dependencies.mdITERATE through each unit test file DOING:

RUN each test test script one at a time e.g.

RUN pnpm run test <module name>.test.ts

Find the first failed unit test suite.
DELETE the content of the document called <test unit file name>-rca.md Example: src/myModule/file1.ts => test/myModule/file1.test.ts => test/myModule/file1-rca.md. SKIP THIS IF the document called <test unit file name>-rca.md

SELECT one (1) test failure ONLY - ignore the other failures for now, we will come back to them later. WE need to proceed step-by-step so NO SKIPPING AHEAD!

1. CREATE a Root Cause Analsis with the following
WRITE the failure information from the test output to the RCA document
READ the FAILED test case from the test file and WRITE its code to the RCA document
READ the Scenario document <module name>-scenarios.md document, FIND the scenario for the FAILED test, and WRITE the scenario into the RCA
EVALUATE the test validity:
- WHAT Input values are used in the test? Does this match the scenario? If not, what needs to be changed?
- WHAT mock values are used in the test? Does this match the scenario? If not, what needs to be changed?
- WHAT sort of assertions are implemented in the test? Are the assertions as coded in line with the scenario's expectations?
DETERMINE if overall, the test is correctly implemented; WRITE the result into the RCA
2. IF there is anything that needs to be changed in the test, then CORRECT the test code and REPEAT THE FULL PROCESS FROM THE START - DO NOT trace the code.
3. IIF the test DOES NOT NEED CHANGES, continue with the CODE FIXING PROCESS

CODE FIXING PROCESS: go through the following steps adding each step’s findings to the RCA document

1. Look at the test output especially the Expected and Received values to determine the external description of the problem to know what you are looking for.
2. READ the <root>/trace.log and FIND the logs related to the FAILED TEST ONLY to the RCA document. ALSO WRITE the line numbers at which the test logs start and the test logs end.
3. Go through the execution line-by-line from the end of the traces to the beginning, comparing the actual values with the ones that are desired. Find the point of divergence where the execution starting to diverge from the desired execution flow.
4. Look at the code to see if there is any obvious failure.
5. If there is a call to an external function, then it is normally mocked so look to see if the mock is being done properly and can explain the divergence
6. If it is a mock, make sure the import comes after the mocking setup block at the beginning otherwise the mocking won’t take effect. Make sure to verify that the mock values are correct.
7. If there was already attempts at fixing this error at the same point of divergence earlier in this task, then list the attempts and what each failure tells us
8. Create a list of up to 5 likely reasons that explain the difference between expected and actual values at the point of divergence and why. Which is the simplest explanation and why? list the fix needed.
   8.1 Did we try this fix before? It if was, then move on to the next simplest explanation.
   8.2 Does the proposed solution conform to the design in desing/<process name>-process.md or the examples in markdown-serialization-examples.md? If it doesn't move on to the next simplest explanation.
9. Make a recommendation and WAIT for my approval
10. If necessary, update the design document and WAIT for my approval
11. Update the code and run the tests again.
12. REPEAT until all tests pass

Originally I used a "vibe testing" approach where I just let the AI decide what order to test in. Basically that mean running all tests and fixing bugs randomly. This felt like it wasn't structured enough, so I started with an analysis of the dependencies, so that bugs could be fixed in more isolated and foundational functions first, remaining on the same file until all tests passed. This orderly approach was essentially how I would do it myself.

After running the test, the procedure instructs the AI on steps to diagnose the problem and try to find a fix. Again, "vibe debugging" by just asking it to fix the bugs is an easy trap to fall into, but except for the simplest and most obvious of bugs, the LLM "au naturel" doesn't actually have a methodology and seems to be guessing most of the time, like a lazy intern.

First thing - I have it start an RCA file, as a means of breaking down the investigation into steps it can follow. Once the file is complete, the AI can then "reason" over the full set of findings and not just the last part.

We start with the test failure from the test output on the command line:

The test case is retrieved from the test suite code and added:

Based on the test, we retrieve the test scenario from the test plan for any additional information:

Some times tests are incorrectly generated or altered by the AI, so before going too far key parts of the test are validated against the test plan:

If the test is invalid, a fix is proposed for the test suite and the investigation ends.

If valid, we need to assume there's something wrong with the code, so we review the available information to identify what's going on. The log fix is inspected, and given the wealth of information about what steps were executed and what the intermediate values were, this is often sufficient to identify the nature of the problem.

Based on this analysis, it's time to look at code and try to locate the defect. It's instructed to go through the code line-by-line and identify the point where the expected log is different from the actual log ("point of divergence"). Of course the AI is free to use its "vibe debugging" as well, but I found that by going through these steps it tends to zero in on what's wrong a majority of the time.

Having found where the problem occurs, the AI has to determine why it is failing. Often, the reason provided is ludicrous as AI tends to repeat things it has already seen which are clearly not completely impossible, but highly unlikely.

This gave me the idea of asking it to come up with a list of multiple possible explanations and sorting by likelihood. I found that often, amid the silly obviously improbable (to a human) causes, was the real reason and this sorting allowed it to pick a good solution most of the time.

The AI has a lot of inputs to assimilate for the diagnostics task:

the test suite
the code under test
the test plan
the trace log
the original design and guiding examples.

In the following example, the AI comes up with a variety of explanations (including an unlikely one that speculates the test reflects an older version of the code!). It correctly identifies that the test plan was wrong because it doesn't match the design:

Without this process, this last case shows how the AI might go off into the wrong direction and soon the tests and code would no longer reflect the intended design.

On the contrary, when a human who knows programming gives detailed instructions, or when the AI is given a process to come up with something clear and pertinent, it is quite capable of creating an appropriate fix.

Squashing the Bug

Having found the problem and identified a solution, the AI applies a fix for it. This can be a code change, a test change, or even a test plan or design change.

It's necessary to pay attention to what is being proposed - even though it can be tempted to just accept changes blindly. There are a lot of reports that AI is reducing the IQ of programmers however I can confirm it's just a sign that AI reveals the laziness of programmers - which is why they get into programming in the first place, because they hate doing the same thing over and over again, they write programs so they don't have to! But it can be tempting to watch some Netflix while just pressing "OK" from time to time without really checking what the AI is doing.

For My Last Trick

Once all unit tests passed, that means each individual function works as defined. However the mocking ensured that we didn't get tripped up by issues in other places. Therefore we don't really know if the code works for real even if all the unit tests pass.

What we need now is to run integration tests, in which we do no mocking and verify that the whole module actually does what it should - for real!

I use the following prompt to create the integration tests:

# Integration Test Creation Task

Create integration tests for a given process that verify the same test cases as the unit tests, but without any mocking
of the project functions, only mock system packages e.g. file i/o, networking etc.

## Steps

1. List Existing Tests

   ```bash
   # List all unit test files in the process directory
   ls test/markdown-serialization/<process name>/*.test.ts
   ```

2. Create Integration Tests

   - For each unit test file found:
     ```
     test/markdown-serialization/<process name>/integration/<component>.test.ts
     ```

3. Test Structure

   ```typescript
   import { expect } from "vitest";
   import { it } from "@test/test-utils.js";
   import { functionName } from "@/markdown-serialization/<process name>/<component>.js";

   describe("<component> Integration", () => {
     // Copy test cases from unit test
     // Remove all mocking code for any project functions but keep mocking of system functions
     // Use real function implementations
   });
   ```

## Key Requirements

1. Test Coverage

   - Review each unit test file
   - Identify all test scenarios
   - Create corresponding integration tests
   - Maintain same coverage without mocks

2. Real Implementation

   - No vi.mock() calls
   - No mock functions
   - No overrideFn() usage
   - Use actual function implementations

3. Test Cases

   - Same input data as unit tests
   - Same scenarios as unit tests
   - Updated expectations for real output
   - Full error case coverage

4. Dependencies
   - Use real component interactions
   - Test complete processing chain
   - Verify actual error handling
   - Test edge cases with real functions

The goal is to verify that components work together correctly in a production-like environment while maintaining the same level of test coverage as the unit tests.

I had a more primitive version to start with, then asked Claude to generate an improved prompt because I was getting some unwanted deviations!

The main idea is that, now that the unit tests are passing, we can assume the design and test plans are good, so we leverage the tests cases we had previously created but just eliminate the mocking so that all real code gets run. In some cases it can be impossible to run a particular scenario due to lack of control on external services not part of the project, and in that case we can bend the rules and mock the external services e.g. external API call, database call.

Once generated, another prompt is used to run them, investigate problems with the RCA procedure, and generate fixes.

Next time

In the next episode I'll describe how this process went with various models - stay tuned!

Martin Béchard enjoys cooking up new projects with AI. If you need to spice up your development with some AI Coding, please reach out at [email protected]!

Me and my AI coding buddy

356 位关注者

Manon Dupuis

Analyste d'affaires TI avec plus d'une corde à son arc | 20+ ans d'expérience dans le domaine du crédit et de la prévention de la fraude | Apprend rapidement et aime les défis

2 周

Hé, très hate de trouver le temps de lire cet article! Je fais la plupart de mes travaux scolaires avec Gemini 2.0 Flash pour ses capacités VLM. Sinon, on n'est pas très ami lui et moi. Hate donc de voir ce que tu auras à en dire ainsi qu'à propos des autres modèles avec lesquels tu le compareras.

1 次回应

Jag Randhawa, CPA

Insurance and Technology Executive, Award Winning Author, Innovation Speaker

2 周

This is exciting. I’ll looking forward to part 2&3

1 次回应

查看更多评论

要查看或添加评论，请登录

Martin Bechard的更多文章

So, Bob, what have you done for me lately?

2025年3月21日

So, Bob, what have you done for me lately?

The business world is a cruel place, and AI Coding assistants are not getting a free pass, with a fair amount of human…

2 条评论
AI is getting dumber!

2025年3月19日

AI is getting dumber!

A few friends of mine have been talking about how models are starting to fall apart with the new "Reasoning" training…

4 条评论
Reasonings found in a bathtub

2025年2月5日

Reasonings found in a bathtub

Since the end of 2024, the latest evolution of Large Language Models is dominated by so-called Reasoning models, with…
ClaudePS: A Prompting Tool for Claude Sonnet

2024年12月19日

ClaudePS: A Prompting Tool for Claude Sonnet

If you are, like me, an extensive user of Claude Sonnet 3.5, you create multiple projects, each having dozens of…

1 条评论
Architecting a Queuing Solution With Claude Sonnet 3.5

2024年12月14日

Architecting a Queuing Solution With Claude Sonnet 3.5

The other day, I did some Yak shaving. I had a little problem which, upon reflection, turned into a big problem with…

2 条评论
Developing with Anthropic MCP (Part 1)

2024年12月2日

Developing with Anthropic MCP (Part 1)

Anthropic has just released the Model Context Protocol and a new version of Claude Desktop as a new way of integrating…
Cline - New (Old) Kid in Town

2024年11月19日

Cline - New (Old) Kid in Town

There's a new AI Codeslinger in town called Cline. Born ClaudeDev, Cline got a name change for marketing reasons.
Perplexity vs. OpenAI: Battle of the AI Search Titans

2024年11月1日

Perplexity vs. OpenAI: Battle of the AI Search Titans

Earlier today I saw that OpenAI posted on LinkedIn that it had released its much-vaunted "AI Search" which had been in…
Building Swarm-JS (Part 1)

2024年10月28日

Building Swarm-JS (Part 1)

Recently Anthropic released Swarm, an "Agentic" open-source framework in python. As the README says: An educational…
Putting the "New" Claude Sonnet 3.5 through its paces

2024年10月24日

Putting the "New" Claude Sonnet 3.5 through its paces

I was recently hitting the limitations on Claude Sonnet's output on a regular basis, as part of getting Claude to…

1 条评论

See all articles