The Worst or Most Difficult Bug ?? I’ve Encountered

The Worst or Most Difficult Bug ?? I’ve Encountered

???? Ver versión en Espa?ol ????.


Someone asked me what the worst bug I have encountered and solved.?

In a widely used application I worked on, we found that 3 out of every 1000 transactions (0.003% of the time) were recorded with another customer's name. Our team suspected it had something to do with concurrency, but the curious thing was that it had nothing to do with:

  • The time.
  • The region of the country.
  • Matching data between customers.
  • Common data among users registering the information: office, manager, etc.
  • It wasn't simultaneous transactions or transactions created almost simultaneously; there could be hours of difference between the entered data.


Due to the low frequency, only a few hours were allocated for investigation, during which no solution was found. For almost a year, we only occasionally debated the strange phenomenon and its possible causes until one day, I found the solution while trying to fix something else.

The error was caused by a combination of anti-patterns (the opposite of design patterns) and other human errors during development in a class of the project that:

  • Violated SOLID's Single Responsibility Principle (SRP), summarized as "one class, one task." In this case, the class exposed two services simultaneously.
  • It exhibited the code smell of "Variable Shadowing" or "reusing variable names in different scopes," which generally only confuses developers when reading the code. Combined with the previous list item, this generated the bug. See Do not reuse variable names in sub-scopes.
  • This was difficult to perceive since the class had more than 1000 lines, making it an excellent example of the "God's Class" anti-pattern. However, static code analysis using SonarQube had reported the code smell as soon as it was produced, and everyone on the team (including myself) took it as an aesthetic suggestion.


Explanation


After finding the problem and solution, we could understand exactly why the situation occurred: a mix between system behavior and a part of user behavior that we were unaware of:

  • To ensure application availability, a horizontal scalability strategy is applied. The system generally has one instance and three threads, increasing to nine during peak hours.
  • The usual behavior of users caused them to process a client quickly through the entire application flow. Thus, when the application flow called the second service of God's class, the same instance that had executed the first service responded.
  • However, the business dynamics sometimes caused a pause in the application flow while the user and client negotiated or clarified terms. When the flow resumed, it could be handled by a different thread with another client's data in the global variable.


Trying to intervene as little as possible and provide a quick solution, I eliminated the class's global variables and modified all involved methods and objects to communicate values via parameters. Recognizing that the class exposed two services simultaneously, I also modified all involved techniques and objects to communicate values via parameters.


Broken Windows Theory?


Shortly before finding the solution, since it was impossible to reproduce the situation in a test environment, I decided to place some logging points where I suspected data crossing could occur and throw an exception if swapped data arrived at the end of the flow. At that moment, we realized that the case did not happen three times per 1000 transactions (0.003% of the time) but about 30 times per 200 (15%), meaning it was pretty common, and users had simply learned to live with the problem, quickly passing the user through the flow again, ensuring the entire flow was handled by the same execution thread.


This situation reflects the Broken Windows Theory in the context of software products: not fixing defects (broken windows) quickly leads to undesired behaviors in society (users), such as vandalism and apathy, with the latter being the most detrimental to software. Our users did not report this very common situation because they did not believe we would take them seriously (apathy); we never prioritized it because we thought the situation occurred very few times a month.


Conclusion

?

In retrospect, two actions can be taken in this situation, one preventive and the other corrective:

  • Take static code analysis recommendations very seriously, implementing a culture of 100% indicators within the definition of done for each feature.
  • Prioritize defects as soon as they are reported, regardless of their frequency. If the solution is not found, consider setting up error logging points or even throwing exceptions to prevent the situation from occurring entirely.

#bug #code-smells #concurrencia #scalability #experience #thread #instance #issue #multi-thread #design-pattern #anti-pattern #services #solid #srp

?

要查看或添加评论,请登录

Alex Andrade, M.Eng., Master QA Automation Engineer的更多文章

社区洞察

其他会员也浏览了