Unable To Reproduce
“Seeing a spider is not a problem. But it becomes a problem when it disappears.”
I can only fix what I can reproduce and I should be able to reproduce any real issues. That is the mindset that many engineers have when they are triaging issues and trying to reproduce bug reports. On one hand, it’s hard to find fault with their approach because ruthless prioritization dictates the need to prioritize issues that are impacting more users and easier to reproduce. However, this approach also leads to a false equivalency between ease of reproduction and issue severity, sometimes leading to the ill-advised de-prioritization of important bugs.
Show Empathy. A product leader at LinkedIn recently confided in me how helpless it made him feel to report a seemingly severe bug only to see it shelved just because the oncall engineer could not reproduce. I can easily empathize with this leader because I would be equally frustrated if I kept experiencing the same issue with no remedy in sight. Well, for one, it is always important to openly acknowledge that the reporter is getting a less than ideal experience because of a bug we introduced, albeit a difficult one to reproduce...so far. Just because the issue is intermittent, it does not make the reporter any less credible or the engineers any less culpable. As the builders and owners of the experience, we are completely at fault here and should take full responsibility for this suboptimal experience, which is only compounded by our inability to offer the reporter immediate relief or at the very least, reason for hope.
Be Transparent. As part of our triage, we should overcommunicate on all the steps taken to reproduce this bug. This serves two purposes. First, it gives the reporters enough confidence that their report was taken seriously and sufficient effort went into trying to address their concerns. Secondly, it leaves a valuable audit trail of the work that has already gone into trying to reproduce this bug so we can be methodical in our triage, should this bug reappear or we decide to come back to it later. This documentation should ideally also include any observations around metrics and other signals that are helpful in capturing the severity and impact from the observed behavior.
Be Systematic. As product owners, we should be systematic in our reproduction efforts because a random check to find a random bug is only going to lead to more random outcomes. For instance, try to have multiple people reproduce because one person’s biases and usage patterns may not allow for the easy discovery of the bug. Or, we could try to narrow down the possibilities through a process of elimination instead of just randomly trying different paths to the faulty behavior.
Look for Correlation. And lastly, we should keep a careful eye on the issues that we are unable to reproduce and look for any suspicious patterns. Isolated, these incidents may seem like unusual corner cases, but considered in totality, they may reveal something far more troubling with how the system was architected. Often times, issues that are difficult to reproduce are manifestations of erratic side-effects borne out by the software’s failure to behave systematically. And much like the sneaky spider, these bugs can continue to wreak havoc under the cover of non-determinism.
These issues can often be irritating for both reporter and owner, and they are just as inevitable as they are unwanted. So how do you and your organization strike the right balance between responsible product ownership, ruthless prioritization and empathy for your bug reporters, especially when these issues persist?
I want to thank Pete Davies for providing the inspiration for this topic, being generous with his time and giving me helpful editorial feedback. I also want to thank Lee Mallabone for his help in completing this piece.
To see my writings beyond "Stuff Engineers Say," visit my articles page or follow me.
Expert at helping you turn your ideas into innovative, reliable, profitable & manufacturable electronics and software products
4 年I’ve seen the situation where even given the exact steps to reproduce a bug there was a huge reluctance to reproduce the issue. In this case there was a big “them vs us” culture where the product was designed on one site and tested at another.
Developer
5 年Unable to reproduce is fine as a status for a bug so long as it doesn't end up in the pile with resolved reports. At sites I belong to where members are stakeholders, the bug tracking system is publicly visible so you marking a bug as unable to reproduce does not prevent me/another user from trying to reproduce. More info is good. Since I am another average person at many sites and the dev crews make up not even 1% of total users in them...it's more likely that someone else can see that [unable to reproduce] label and has either already come across it or could possibly reproduce it which may bring forward a viable solution from a dev who otherwise can't solve problems which they can't identify.? It works for them because that place cares more about finding all the bugs and fixing them than they do about people knowing how many bugs something shipped with.? We as the community around these places benefit from it and we like them that much more for it. We spend our time helping improve their product and increasing the value of their company and in direct return we get a better product and a more stable ecosystem. It's symbiotic, and a public/UG bug tracker mostly builds trust and brings faster results in a two-way fashion.
Staff Engineer at LinkedIn
5 年Great article Bef. Now a days softwares are so complex that there are millions of usecases that needs to be tested. The reproducibility probability of few issues are high but for few issues it’s so low that they appear ghostly issues and reproducing them require many attempts. This reminds me the experiments performed by Lord Henry Cavendish to find the value of Universal Gravitation Constant. Reproducing a low probable issue is an example of great craftsmanship and every failed attempt by developer gives him a lot of experience. So definitely you made a great point.
Engineering Leadership @ LinkedIn
5 年It strikes me that often times bugs are hard to reproduce due to the complexities of state within the system. Somewhere someone will make an argument for purely functional programming.
SVP of Engineering @ Yahoo
5 年Great article, Bef. I’d like to add one more thought to the conversation. A critical skill to develop and practice is to form and test hypotheses about what’s causing the bug. I think a common fallacy is to focus too much on reproducing from the outside in. As engineers, we shouldn’t treat it as a black box. We know how the software is built and if we ask ourselves what might cause the bug, we can often get to the root without the benefit of perfect repro steps.