Notes on "A Formal Analysis of Iterated TDD"

In recent conversions about Test-Driven Development (TDD) I was shown a preprint paper which has been submitted to IEEE Transactions on Software Engineering. One of the paper's authors encouraged me to post my thoughts, which I've done here.

Depending on who you ask, the paper's conclusion either aims to refute any usefulness of TDD or quite the opposite, to confirm proponents' longstanding advice to using the TDD-cycle as design feedback.

While I found the arguments unconvincing, I support attempts to better codify our understanding of Software Engineering and hope my explanations will help these authors and others in that pursuit.

Disclosing my biases

I've been a professional software developer for 17 years and for the first 7 years worked on teams strictly practicing Extreme Programming (XP). This included TDD and pairing, and I consider that to be the most consistently productive and high-quality environment I've seen. I am thus naturally predisposed towards TDD's effectiveness.

However I am also aware of the difficulties with driving adoption in the variety of contexts I'm interested in. In my understanding, the state of the academic literature on TDD as a stand-alone practice has positive but inconsistent results, not what we'd expect given dramatic results of some case studies of full XP. I'd be open to the hypothesis that TDD proponents' positive experience is mostly due to some mix of Continuous Integration, design training, rigorous automated testing, and disciplined refactoring.

In short, I suspect the inconsistent "hard data" on TDD is attributable to one of these and I'm not sure which:

  • The inherent difficulty of isolating Software Engineering practices in empirical studies
  • TDD's association with numerous related behaviors that bring value on their own

The paper

This PDF resembles an academic paper but lacks some of the crucial substance of one. I expect that if the submitted journal accepts it, there will need to be extensive changes.

The most immediately glaring issue is the failure to position the work in the context of the literature.

  • Nearly every source cited is a Wikipedia entry
  • Areas with extensive existing research are discussed without acknowledgment of that work
  • On at least one occasion an absence of existing literature is falsely asserted

In a research paper, the majority of citations are typically primary research that supports what's being said. A glance through some recent papers from the submitted journal shows examples of that. Wikipedia also has an article called Citing Wikipedia saying:

Normal academic usage of Wikipedia is for getting the general facts of a problem and to gather keywords, references and bibliographical pointers, but not as a source in itself
References page 1/2

Of the 39 citations, only 2 were research papers about software and their findings were glossed over in an aside section "4.7. A Perspective on Popular 'Business Specification based' TDD".

To be fair those seem like solid citations, they are literature reviews on the central topic of the paper. An easy point of improvement would be to better discuss them, probably in a literature review in the introduction, as recommended by IEEE's How To Structure Your Paper.

A random message board vendetta

Bizarrely, despite the brief inclusion of prior work on TDD effectiveness, the absence of any such investigation is confidently proclaimed in the next section.

The proponents of TDD, or “industry best practices” stopped asking “is this effective or provable” a long time ago.

How can it be that people "stopped asking" a long time ago when a citation just given on it was from 2020? That's fairly recent and of course wasn't the last. Or do they mean that specifically "TDD proponents" don't engage with what the researchers put out? Fair to say devs generally aren't connected enough with active research, but I'm making an effort and I'm really not that unique. Self-described "TDD person" Hillel Wayne even gave a talk called What We Know We Don't Know: Empirical Software Engineering. So maybe not everyone stopped asking.

The evidence given that they stopped is a quote from an unreferenced social media comment from an unnamed person.

[quoting] "You want to debate seriously? Then you have to drop the ridiculous sense that 'Good Practices' require scientific evidence before they can be realized to work - which would disprove much of the 'Good Practices' which are 'successfully used' in the industry."
Even if we ignore the irony of the previous quote, one but just wonder if evidently Software had become entirely cargo cult, the above quote proves it beyond doubt.

While out of context it does appear to be an unhelpful comment, it's not appropriate to vent about a social media argument in a journal paper, let alone represent it as the position of all TDD proponents.

I cannot stress enough how much it undermines the effort that the authors have spent in their mathematical modeling to frame it in such flippant treatment of sources.

Methodology and definitions

I was intrigued by the peculiar style of mathematical modeling done in the paper, I'm not aware of similar methods to draw these kinds of conclusions about software practices. The efficacy of that kind of modeling to draw conclusions without even studying any codebases let alone teams with their psychological / sociological behaviors over time - if that worked it would be incredibly useful. Unfortunately, no past research along those lines was cited. If it's original and works, perhaps it should be the subject of its own paper so it can be properly defended and compared to other modeling techniques.

There is no code in this paper

There appears to be only one actual program even mentioned (cat.c) which wasn't related to TDD. Example code for the various scenarios discussed would have been very helpful, either inline or as a companion resource. One of the paper's authors advised exactly that in an interesting piece called Re-Thinking Software As Engineering, saying:

If some manual says : 'this is recommended practice' - source code needs to be seen to find out if the practice makes any sense or it is a clever ploy to discourage people into finding issues in the software construction itself

Redefining the goal

The goals of TDD are stated to include:

The system is provably complete and correct, by construction [...] This is the real superpower of TDD, formal verification baked into development

That is a mischaracterization. TDD is not formal verification. It claims high-confidence, which sounds pretty close in casual speech but formally speaking these are miles apart. This likely undermines some of the derivations that follow.

A convenient explanation of the goals of TDD are the Preface and Introduction of "Test-Driven Development by Example" by Kent Beck, which are visible in Amazon's preview.

Tests with 100% branch coverage (as careful TDD might produce) are high-confidence but do not assure a complete absence of bugs. These remaining gaps are a key use case of Mutation Testing and Formal Verification.

Non-standard notion of coupling

Coupling is central to the argument and is defined by shared execution path between equivalence classes of a single module. Equivalence classes here roughly refers to the set of test cases required to be an adequate substitute for testing all possible inputs.

coupling said to exists between EQCPs [equivalence classes] Ex with path Px and Ey with path Py if and only if Px ∩ Py = ?.

The measure is then given as the Jaccard coefficient, the ratio of the size of intersection of the execution path to the size of the union.

| Px ∩ Py | / | Px ∪ Py |

That's not what Coupling is. Sure it's an idea of two tests being "coupled" in the sense of having overlapping execution paths, but when we talk about the Coupling as a design principle it's interdependence between modules, not between tests of the same module. These two meanings seem to be used in the paper interchangeably.

( Rodolfo Hansen has offered an interpretation of the correspondence, which I've included in the comments.)

Clearer terms that come to mind are Intertest Coupling or Test Execution Overlap, but a literature survey might find something more standard. For the paper to cite the Wikipedia entry on Coupling while stating a different definition by the same name is confusing and likely to cause people to misunderstand the claims.

Assuming a single module

Dynamics of exponential growth as program size increases are repeatedly discussed, for example:

a simple program unix cat has more than 60 branches. The equivalent class specification of this program is bounded by 2^60 and the total stars in the universe are estimated to be 2 × 10^24 for comparison

It's true that the number of paths through a program grows exponentially with branches. Metrics like Cyclomatic Complexity (Thomas McCabe) have been available to track this for nearly 50 years and the impact on testing is well known.

TDD practitioners address these sorts of issues in the usual way, through modularization. Test small pieces. This continually mitigates this complexity of the units under test and avoids the predicted effects from branch growth.

Generally, it's hard to grasp the modeled predictions when they don't seem to correspond to TDD as practiced and attempt to strictly maintain a criteria (Formal Verification) which TDD doesn't claim nor do the mentioned alternatives.

Other Oddities

Laws of Software Engineering

It seems in the presence of coupling, we can either choose formal correctness or choose code churn stability, not both. This insight is unheard of, but the theory points us in this direction. If the chaotic thesis is correct, this is to be taken as a foundational law of Software Engineering [emphasis mine]
While this demonstrates why coupling is a problem, however, this is [a] much stronger thesis, this [is] tantamount to any shared code is a problem if the code supposed to change later.

Are we just going to drop a "foundational law of software engineering" as a small side-note here? What does that mean? What are the other laws?

If it's meant to be that important maybe it should be the focus of the whole paper. Instead of "Formal Analysis of Iterated TDD" it could be something like "Towards the Formalization of Software Engineering Using Chaos Theory". And that paper could outline what it means to have a formal theory of Software Engineering, the goals of doing so, and how this compares to other approaches.

Incompleteness Theorem

The correctness of TDD for a practical application hinges on the following: (1) Is the specification complete enough (to take care of all the equivalent classes)? (2) Is the specification non-contradictory ?
That it is impossible to get (1,2) done together follows from G?del's Incompleteness theorems [...]

G?del's Incompleteness Theorems concern limits of formal axiomatic systems, not software test cases. It doesn't seem to apply here and it's not explained why it would. Then the result isn't used because it's not specific to TDD, so it's not clear why the section is included.

Some fair points

There are also some ideas mentioned that do seem reasonable to discuss as TDD critique and caveats.

Design sold-separately

Note that the methodology does not specify how to implement the paths of each equivalent classes in the code. Hence evidently there is no way it can ever improve on the 'non correct aspect of quality' of software, one of them would be to lower coupling.

This is mentioned in a few different ways. It's true that the TDD process says to refactor but doesn't not specifically say how. The process of building usage examples as you go exerts design pressure, but I'd agree you need design skill too.

Rework

There is a heavy focus on "churn", code modified after it's first written. It's even explained to be the subject of a combinatorial explosion based equations of theoretical behavior. This would seem to suggest it's significant enough to ground in past observations so examples would be helpful to point out. It's even asserted "This is also seen in reality", with no citation.

Still, most people would likely agree churn does create some amount of annoyance in practice. In TDD, you do have to update old tests as the design changes. The idea of strategically minimizing churn is worthwhile though not new. For anyone struggling with this problem, I'd suggest trying IDE-assisted refactoring, Kent Beck's Test Desiderata (especially Structure-insensitive) and Justin Searls' talk How To Stop Hating Your Tests.

Only for unit tests?

Hence we propose iterated TDD is to be done at the Unit Testing level only

Could make sense. For what I'd call unit level that's roughly what I do, not to speak for others. The TDD flow leaves a foundation of unit tests, and gaps are filled separately, perhaps with a smaller number of wider scoped tests forming a Test Pyramid.

To flesh that out: what exactly does that look like? With no examples of what they are considering the unit level, it's unclear what to make of that.

Pushing for empiricism

It's true that Software Engineering (not exclusively TDD) would be well-served to bring in more disciplined observation in our decision making. We often end up led by experienced people using their gut feel as fact. That's a great thing to want to help fix, and must be careful not to fall victim to it in the process.


academicreviewpro.com AI fixes this Feedback on "Iterated TDD" paper.

回复
Ray Myers

Tech Lead | Mender | Untangler

3 个月

Hi Hemil Ruparel and Nabarun Mondal. I think that after you hear back from the referees these notes might be useful for you in revisions or future work. I also had a good conversation with Rodolfo Hansen and was intrigued that he largely agrees with your findings (or his interpretation of them) yet describes the implications very differently than how I've seen Hemil summarize them. I have an idea of what might be going on there: Your notion of "Guided TDD" could correspond to what some practitioners call "design pressure" or "synergy between testability and good design" where code becoming more difficult to test is used as a signal to inform iterative design. At first glance that could put a different narrative to the same mathematics, I'd sure we'd both be able to provide more information if it seemed interesting to explore. Best of luck.

Ray Myers

Tech Lead | Mender | Untangler

3 个月

Kicking myself for not calling this "An Informal Analysis of 'A Formal Analysis of ...'"

Ray Myers

Tech Lead | Mender | Untangler

3 个月

Rodolfo Hansen has offered this interpretation of the Coupling definition. Thanks! "The paths Px and Py are paths in the production code. So, they attempt to rigorously define inter module dependencies. as in: Pi?calls foo() in module A, from bar() in module B. Pj calls foo() in module A, from baz() in module C. Thus Pi and Pj are coupled. This attempts to fit the standard definition of module coupling. This is a valid definition of coupling in assembly language, or even earlier versions of the Go programming language. But fails to capture the loose coupling achieved by techniques towards solving the expression problem in more 'expressive' languages."

Dr. Michael K?pf

Wir bringen deinem Team DevOps bei – stressfrei & wissenschaftlich fundiert | 2-Tage-DevOps-Workshop für stabile Releases & effiziente Workflows | Bew?hrt bei 50+ Teams

3 个月

For a more serious scientific investigation of TDD I recommend the papers by Davide Fucci - Warning: They are certainly not a glorification of TDD, but I wouldn’t expect that from any serious research. They also show the difference between textbook TDD (red-green-refactor) and what people actually do.

要查看或添加评论,请登录

Ray Myers的更多文章

  • OpenTF is NOT the fork

    OpenTF is NOT the fork

    Here's the thing. OpenTF is NOT the fork.

    8 条评论
  • Building a Personal Brand Within the Company

    Building a Personal Brand Within the Company

    Someone asked how they might start establishing a "brand" as someone skilled in Software Architecture who helps people…

    1 条评论

社区洞察

其他会员也浏览了