Debugging
(C) https://www.publicdomainpictures.net/pictures/210000/velka/bug-eyes.jpg

Debugging

This is typically how I go about debugging a piece of code:

  1. What is wrong?
  2. Reproducing the error
  3. Finding the source of the bug
  4. Showing that the fix has been made - adding a test

What is Wrong?

In modern systems, working out what is wrong might take some work. Any single end-to-end computation might involve multiple services and calls to several storage systems. An error anywhere along this path may be the source of your bug.

A clear bug report can help enormously, as someone will likely state that system A is not working correctly, narrowing your search. However, it's still possible that System B is causing the error. For example, a bug in System B causes erroneous data to be written to a data store used by System A. This is one reason why debugging multi-service environments is a challenge; the bug might not be computed on System A's path; it might be due to incorrect data while executing that path.

You need to be crystal clear on what should be happening and what is happening. What should be happening is the desired system behaviour. What is currently happening is the bug.

Reproducing the error

This is why it's crucial for you to reproduce the error, because:

  1. You need to know the version of the system you have access to has the reported issue
  2. You will be experimenting with this system to find the source of the bug

If you cannot reproduce the error, the version of the system the issue is being reported in may not be as up-to-date as the version you are using. In this case, an answer may be for the user to upgrade their system. If they are reluctant to upgrade, you may need to experiment with their version of the system.

Finding the source of the bug

You now understand what is wrong and the version of the system that exhibits the error.

As you understand the difference between what should be and what is happening, you know the behavioural gap that describes your bug. To address that bug, you need to find the location of the incorrect behaviour, so you now need to search for that location.

And as we all know, searching for something takes time.

This is because this search is typically a process of informed trial and error.

Trying things and seeing what happens is required because, in systems of any size or age, the person who wrote the code is not the person investigating the bug. Even if they are the code's author, they will likely have forgotten how the code works.

In modern, multi-service systems, there isn't a single person who wrote the code, but a whole team of people, probably over the years, some of whom no longer work at your organisation. This contributes to the challenge at hand.

Some errors are more obvious, e.g., your user interface says Usernme instead of Username. To find this error, you can easily search your codebase for Usernme, safe in the knowledge that anything you find can be easily corrected. This is up the trivial end of the search spectrum, but all bugs are not created equal.

Others are more difficult to locate, such as a system displaying only five transactions when you know that 14 should be shown. The cause of this error might be anywhere in your system, the result of executing any end-to-end computation path.

In this case, one place to start would be to look at the user interface code. Is all the transaction data successfully displayed? If so, you might look at the code that returns the list of transactions to confirm that this component of your end-to-end path is working correctly. If it is, you move to the code that generates the list of transactions to check that its logic is correct, as is the code that writes the list wherever it is returned from. If these are correct, you then move to the code describing each transaction's contents to confirm that 14 should be generated under these circumstances. Maybe 14 is wrong. Is five correct in this case? If fourteen is correct, there is still a bug somewhere; five transactions are being displayed, but the correct number is 14. Something else must account for the error. You then look at other areas, such as how data is exchanged between the user interface code and the returning service. This interaction may be terminated too early as you see fewer results than are required.

As you can see, finding a bug requires a series of questions to be asked that you then answer to confirm or deny the question. If you confirm the question, you have found the bug. If you deny the question, you move on to the next question.

There might come a time when the quality of your questions dips because you are getting tired or need more places to look. Take a break or ask a colleague. Or better still, take a break and ask a colleague. When explaining what you have found to a co-worker, the likely location presents itself, or a good line of enquiry becomes clear.

In modern systems, finding the source of a bug takes work. Every line of code you release is an additional line of code you must exclude when investigating the next bug.

In our multi-service systems, setting up a working end-to-end version to investigate can be challenging as the software engineer's local environment (typically a resource-constrained laptop) is the easiest for them to use. Giving the engineer additional resources in the cloud can help the system be based on actual transactions and not rely on mocking. If you are relying on mocking, you aren't testing the version of the system with the error, you are testing the mock which isn't the version the user with the error is using.

As part of your bug investigation, you take a closer look at the payload returned to the user interface code by the remote service. It has header information with the number of contained transactions. The value is 14, which is good as the service returns the correct number of values. However, when you take a closer look at the payload contents, you notice that the syntax is corrupted, so when the user interface iterates the list of transactions, only five are correctly processed, not the full 14.

You have successfully found the source of the bug.

This investigation may have taken a number of days: setting up any environment to enable the investigation to take place and narrowing down the location of the bug through trial and error and discussion with your colleagues. Setting up the environment may take longer than you hoped, as you might be reliant on others to give you permission to access systems.

Showing that the fix works - adding a test

You now know where the error happens within your system.

There are two things to do:

  1. Add a test to the system to expose the bug
  2. Fix the issue

Point 1 is essential. This kind of bug is hard to find. In a codebase with multiple developers, it is crucial to your business for the system to be self-monitoring to increase the quality of your code. If you fix the bug but don't add a test, should the error happen somewhere else, no one will notice until the error negatively impacts your customer. In that case, they raise a new bug report, and you have to expend the effort to investigate again --- disrupting you from your more strategic work. The knock-on effect of not putting in a test is significant.

Adding a test prevents all of this from happening because if the error is reintroduced (e.g., a faulty third-party library generating the service payload), then your test will fail, alerting you to the issue immediately, at build time before your code is released into production and is used by your customer.

You fix the issue by updating the third-party library, and your test shows that the problem has been resolved. You release a new version to production, and all fourteen transactions are displayed for the user, who is now happy again.

Conclusion

Debugging modern, multi-service systems is hard. Such investigations typically require dedicating time to hunting for the issue in an extensive codebase likely to exhibit some subtle behavioural characteristics. Without adding a test once you have found the error, should the problem occur again, you will have to live with the risk of the bug impacting your customer again in the future.

要查看或添加评论,请登录

Huw Evans的更多文章

  • The Joys of Caching

    The Joys of Caching

    Caching data can improve system performance. Let's take a look.

  • Debugging and the Scientific Method

    Debugging and the Scientific Method

    The scientific method helps you gain knowledge [1]. You make an observation and test it with an experiment that shows…

  • Understanding Inconsistency with SUDs

    Understanding Inconsistency with SUDs

    This article shows why inconsistency and latency are fundamental when building distributed systems and how PACELC and…

  • Software Engineering builds two Things

    Software Engineering builds two Things

    When we write software, we build two things. The software that provides the business solution.

  • Smaller teams are more reactive

    Smaller teams are more reactive

    On November 2 2022, I wrote an article on how I had recruited 12 new employees. This article covers what happened next.

  • Lazily filtering out non-Cats

    Lazily filtering out non-Cats

    In a previous article, I discussed how to safely generate a list of subtypes from an original list defined on a…

  • Cats are not Dogs

    Cats are not Dogs

    Who in life has not tried to do this? Trying to treat a list of Animal as a list of a subtype. This does not compile in…

  • Failure is a subtype of Success

    Failure is a subtype of Success

    This article considers how to cleanly handle both the failure and success paths in code, taking a look at how Java's…

  • Teaching Agile gives student a fish

    Teaching Agile gives student a fish

    Teaching a student or colleague agile software development or more generally agile project management gives them a…

  • Agile Manifesto #5

    Agile Manifesto #5

    The Agile Manifesto [1] states that the left-hand side of the following are preferred over the right-hand side:…

社区洞察

其他会员也浏览了