登录查看更多内容

Debugging

Huw Evans

Head of Retail Engineering at Fruugo.com

发布日期: 2024年3月14日

+ 关注

This is typically how I go about debugging a piece of code:

What is wrong?
Reproducing the error
Finding the source of the bug
Showing that the fix has been made - adding a test

What is Wrong?

In modern systems, working out what is wrong might take some work. Any single end-to-end computation might involve multiple services and calls to several storage systems. An error anywhere along this path may be the source of your bug.

A clear bug report can help enormously, as someone will likely state that system A is not working correctly, narrowing your search. However, it's still possible that System B is causing the error. For example, a bug in System B causes erroneous data to be written to a data store used by System A. This is one reason why debugging multi-service environments is a challenge; the bug might not be computed on System A's path; it might be due to incorrect data while executing that path.

You need to be crystal clear on what should be happening and what is happening. What should be happening is the desired system behaviour. What is currently happening is the bug.

Reproducing the error

This is why it's crucial for you to reproduce the error, because:

You need to know the version of the system you have access to has the reported issue
You will be experimenting with this system to find the source of the bug

If you cannot reproduce the error, the version of the system the issue is being reported in may not be as up-to-date as the version you are using. In this case, an answer may be for the user to upgrade their system. If they are reluctant to upgrade, you may need to experiment with their version of the system.

Finding the source of the bug

You now understand what is wrong and the version of the system that exhibits the error.

As you understand the difference between what should be and what is happening, you know the behavioural gap that describes your bug. To address that bug, you need to find the location of the incorrect behaviour, so you now need to search for that location.

And as we all know, searching for something takes time.

This is because this search is typically a process of informed trial and error.

Trying things and seeing what happens is required because, in systems of any size or age, the person who wrote the code is not the person investigating the bug. Even if they are the code's author, they will likely have forgotten how the code works.

In modern, multi-service systems, there isn't a single person who wrote the code, but a whole team of people, probably over the years, some of whom no longer work at your organisation. This contributes to the challenge at hand.

Some errors are more obvious, e.g., your user interface says Usernme instead of Username. To find this error, you can easily search your codebase for Usernme, safe in the knowledge that anything you find can be easily corrected. This is up the trivial end of the search spectrum, but all bugs are not created equal.

领英推荐

How to replace Koin with Hilt for Dependency Injection

Powerplay 2 年前

From Flaky to Bulletproof

Fluid Attacks 3 个月前

Testing is an Unsolved?Problem

Jason Arbon 3 年前

Others are more difficult to locate, such as a system displaying only five transactions when you know that 14 should be shown. The cause of this error might be anywhere in your system, the result of executing any end-to-end computation path.

In this case, one place to start would be to look at the user interface code. Is all the transaction data successfully displayed? If so, you might look at the code that returns the list of transactions to confirm that this component of your end-to-end path is working correctly. If it is, you move to the code that generates the list of transactions to check that its logic is correct, as is the code that writes the list wherever it is returned from. If these are correct, you then move to the code describing each transaction's contents to confirm that 14 should be generated under these circumstances. Maybe 14 is wrong. Is five correct in this case? If fourteen is correct, there is still a bug somewhere; five transactions are being displayed, but the correct number is 14. Something else must account for the error. You then look at other areas, such as how data is exchanged between the user interface code and the returning service. This interaction may be terminated too early as you see fewer results than are required.

As you can see, finding a bug requires a series of questions to be asked that you then answer to confirm or deny the question. If you confirm the question, you have found the bug. If you deny the question, you move on to the next question.

There might come a time when the quality of your questions dips because you are getting tired or need more places to look. Take a break or ask a colleague. Or better still, take a break and ask a colleague. When explaining what you have found to a co-worker, the likely location presents itself, or a good line of enquiry becomes clear.

In modern systems, finding the source of a bug takes work. Every line of code you release is an additional line of code you must exclude when investigating the next bug.

In our multi-service systems, setting up a working end-to-end version to investigate can be challenging as the software engineer's local environment (typically a resource-constrained laptop) is the easiest for them to use. Giving the engineer additional resources in the cloud can help the system be based on actual transactions and not rely on mocking. If you are relying on mocking, you aren't testing the version of the system with the error, you are testing the mock which isn't the version the user with the error is using.

As part of your bug investigation, you take a closer look at the payload returned to the user interface code by the remote service. It has header information with the number of contained transactions. The value is 14, which is good as the service returns the correct number of values. However, when you take a closer look at the payload contents, you notice that the syntax is corrupted, so when the user interface iterates the list of transactions, only five are correctly processed, not the full 14.

You have successfully found the source of the bug.

This investigation may have taken a number of days: setting up any environment to enable the investigation to take place and narrowing down the location of the bug through trial and error and discussion with your colleagues. Setting up the environment may take longer than you hoped, as you might be reliant on others to give you permission to access systems.

Showing that the fix works - adding a test

You now know where the error happens within your system.

There are two things to do:

Add a test to the system to expose the bug
Fix the issue

Point 1 is essential. This kind of bug is hard to find. In a codebase with multiple developers, it is crucial to your business for the system to be self-monitoring to increase the quality of your code. If you fix the bug but don't add a test, should the error happen somewhere else, no one will notice until the error negatively impacts your customer. In that case, they raise a new bug report, and you have to expend the effort to investigate again --- disrupting you from your more strategic work. The knock-on effect of not putting in a test is significant.

Adding a test prevents all of this from happening because if the error is reintroduced (e.g., a faulty third-party library generating the service payload), then your test will fail, alerting you to the issue immediately, at build time before your code is released into production and is used by your customer.

You fix the issue by updating the third-party library, and your test shows that the problem has been resolved. You release a new version to production, and all fourteen transactions are displayed for the user, who is now happy again.

Conclusion

Debugging modern, multi-service systems is hard. Such investigations typically require dedicating time to hunting for the issue in an extensive codebase likely to exhibit some subtle behavioural characteristics. Without adding a test once you have found the error, should the problem occur again, you will have to live with the risk of the bug impacting your customer again in the future.

要查看或添加评论，请登录

Huw Evans的更多文章

The Joys of Caching

2024年4月15日

The Joys of Caching

Caching data can improve system performance. Let's take a look.
Debugging and the Scientific Method

2024年3月25日

Debugging and the Scientific Method

The scientific method helps you gain knowledge [1]. You make an observation and test it with an experiment that shows…
Understanding Inconsistency with SUDs

2023年12月1日

Understanding Inconsistency with SUDs

This article shows why inconsistency and latency are fundamental when building distributed systems and how PACELC and…
Software Engineering builds two Things

2023年11月24日

Software Engineering builds two Things

When we write software, we build two things. The software that provides the business solution.
Smaller teams are more reactive

2023年11月8日

Smaller teams are more reactive

On November 2 2022, I wrote an article on how I had recruited 12 new employees. This article covers what happened next.
Lazily filtering out non-Cats

2023年9月5日

Lazily filtering out non-Cats

In a previous article, I discussed how to safely generate a list of subtypes from an original list defined on a…
Cats are not Dogs

2023年9月2日

Cats are not Dogs

Who in life has not tried to do this? Trying to treat a list of Animal as a list of a subtype. This does not compile in…
Failure is a subtype of Success

2023年7月15日

Failure is a subtype of Success

This article considers how to cleanly handle both the failure and success paths in code, taking a look at how Java's…
Teaching Agile gives student a fish

2023年6月30日

Teaching Agile gives student a fish

Teaching a student or colleague agile software development or more generally agile project management gives them a…
Agile Manifesto #5

2023年6月16日

Agile Manifesto #5

The Agile Manifesto [1] states that the left-hand side of the following are preferred over the right-hand side:…

See all articles

Debugging

Huw Evans

Head of Retail Engineering at Fruugo.com

What is Wrong?

Reproducing the error

Finding the source of the bug

领英推荐

Showing that the fix works - adding a test

Conclusion

Huw Evans的更多文章

社区洞察

其他会员也浏览了

Why you should use PEP 8 guidelines ?

Do you have time bombs in your system?

Best Practices for Debugging and Profiling Node.js Applications

Debugging Impossible Bugs: Try Making It Worse

8 Stages in debugging a Software Crash

Comments in Code: Why Less is Often More

?? My Code Works… I Have No Idea Why ??♂?

Performance Check

Debugging the Sherlock Style

What is Wrong?

Reproducing the error

Finding the source of the bug

领英推荐

Showing that the fix works - adding a test

Conclusion

Huw Evans的更多文章

The Joys of Caching

Debugging and the Scientific Method

Understanding Inconsistency with SUDs

Software Engineering builds two Things

Smaller teams are more reactive

Lazily filtering out non-Cats

Cats are not Dogs

Failure is a subtype of Success

Teaching Agile gives student a fish

Agile Manifesto #5

社区洞察

其他会员也浏览了

Why you should use PEP 8 guidelines ?

Do you have time bombs in your system?

Best Practices for Debugging and Profiling Node.js Applications

Debugging Impossible Bugs: Try Making It Worse

8 Stages in debugging a Software Crash

Comments in Code: Why Less is Often More

?? My Code Works… I Have No Idea Why ??♂?

Performance Check

Debugging the Sherlock Style