Avoiding technical debt disasters
Technical debt is not just duplicated code, large methods and other "code smells" that can be found with static code analysis tools, but includes things like bad architecture, unsupported technologies, lack of automated testing, performance and scalability problems - basically anything that should be fixed now or likely has to be fixed later, often at a higher cost. The more technical debt you have in a system, the more costly it becomes to change it, but you are also more likely to have quality issues, like poor reliability, security and performance.
The debt metaphor is used to show that some technical debt can be intentional; You can take a shortcut to gain speed in the short term, but over time you will have to repay that debt or deal with the consequences. However, most technical debt, by far, is unintentional. Usually, technical debt comes from things like bad prioritization (“we don’t have time to do this properly”), bad design or implementation (“i don’t know how or can’t be bothered doing this properly”) or unrelated circumstances that can't be prevented (e.g. new business requirements or technology changes). Because of this, it is tempting to just talk about technical risk, rather than technical debt. However, the debt metaphor is useful to underline that the problem will get worse over time, if not managed properly.
So how do you avoid development grinding to a halt, or worse, actual disasters due to technical debt? (See Chernobyl and Boeing 737 Max.)
First, a development team can prevent technical debt by making good design choices in terms of technology and architecture, discussing and mitigating risks before implementation, following a common coding standard, and performing pair programming or code reviews.
For technical debt that is not prevented, the development team has to be able to identify technical debt automatically by using static code analysis tools, but also manually during pair programming or code review, and as part of other processes like retrospectives and incident reviews.
When technical debt has been identified and it’s not resolved immediately, the development team has to document that technical debt in the same backlog as everything else to ensure that it becomes part of the normal prioritization and planning process. When the technical debt is documented, it should be analysed and given a risk score. How likely is this to cause trouble, and what is the impact? (If multiple teams use the same scale, this is a good way to get a good overview of the most important technical debt across an organization.)
To avoid spiraling out of control, the development team has to monitor the amount of technical debt over time. They can do that by looking at the sum of risk scores over time (see bottom image), and by looking at metrics in static code analysis tools like SonarQube.
Significant technical debt should be made visible to stakeholders, especially if it helps ensure time or resources to repay it. The risk scores should be used to determine what should be paid down next.
The standard in our industry is to dedicate 20% of the team’s capacity to pay down technical debt, or more if needed. But in addition, development teams and their stakeholders should consider agreeing to a technical debt cap, a threshold which if crossed, means that the team will not spend time on anything other than repaying technical debt.
I hope that was useful. Leave your thoughts in the comments.
CTO & Co-founder at Sorsera
5 年During VCDM on-boarding I actually wondered - do you have a real-life example of a reasonably large project (with considerable history) where the team's decisions were actually based on the risk scores? To me, this still sounds a lot like "In theory, ..." Let's, for instance, take switching to VCDM from an enterprise-centralized model (not sure our previous model is public). The various advantages may be clear, but how would you classify them? How could you compare it to, let's say, UI or BE technology/language change, or full API format change? For each of them, you can have 1, 10s or even 100s of issues. If they are documented by different persons, there's not much usable risk data. The automatic tools are often even worse. The score depends too much on who is creating the tickets... and in what mood he/she/it is. I prefer a more agile approach way by focusing on a single question - what should we do now about tech debt? Instead of monitoring how much debt you have, focus on how much extra work it actually causes - both in past and in planned future. Now choose a starting number, say, the same 20% for tech debt. Adjust the number regularly based on how much extra work tech debt causes. In addition to reduced management/planning costs, remember the impact of bureaucracy on developer well-being and performance. BTW, I prefer to describe the single biggest cause of tech debt as "reasonable design... at the time" or simply "historical reasons". The design decisions are affected by everything (i.e. circumstances), but I like this description as it both hints the cause and the solution. Everything (around and within the product) changes. You either adapt or fail.
Agree with the suggested approach. However keep in mind that there are many good reasons to take on debt (time to market and testing product market fit). Technical debt control / management instead of prevention would perhaps be a better term.