Proper Error Handling
No matter what programming languages you use. Engineers need to make dozens to hundreds of small decisions every day. Such decisions can sometimes save us and, other times, create many problems. Some of these decisions can be called assumptions. Depending on the context, they could be the side-effect, lack of proper discovery, feature factory rushing to delivery, or lack of care. In reality, error handling is one of the most challenging things in computer science, alongside naming, cache invalidation, and off-by-one errors. We usually get error handling wrong. When we are supposed to throw an error/exception and crash the application, we are unlikely to return some bad, sneaky default that will produce a bug down the road; when we need to ignore missing information, we end up crashing the app. Code reviews rarely to never review error handling. It’s common to see services in production that don’t even have the proper exception track due to improper error logging. Error handling is tricky. We can’t have a simple formula we apply to constant scenarios; we need to think, judge, and review our decisions. Ideally, it should be in an explicit form via code review/team design session; otherwise, we will review it when incidents happen in production.
Modern software is often complex. Such complexity usually manifests in various forms, such as many dependencies, internal shared libraries, monoliths, and distributed monoliths. However, complexity is also a set of bad technical decisions resulting from complex business rules and a lack of comprehensive integration tests.
Fail Fast vs Fail Safe
There are basically two ways we can handle software errors. The first option is called Fail Fast. This means we throw an exception or error to break the application by design when something is missing, such as a parameter, value, or state. The Erlang/ Scala community is famous for using a philosophy often called “ Let it crash”. Where the assumption is IF you let the application crash and restart with a clean slate, there is a good chance the error go away.
Fail-safe; conversely, try to “recover” from the error. NetflixOSS was famous for applying this philosophy with a framework called Hystrix, where the code was wrapped around commands, and such commands always had a fallback code. Amazon is famous for preferring to double down on the main path rather than focusing on fallbacks.
Now, no matter if we are more into fail-fast or fail-safe. You need to have proper integration tests to ensure you trigger or activate the non-happy paths of the code. Otherwise, you don’t know if you code it right, and you will discover in production eventually in the most expensive form for you and the final user. Now, what should I do? Here is my guidance.
Fail Safe vs. Fail Fast Recommendations:
Exception vs Errors
Exceptions are usually used “internally” and errors “externally.” Consider you have a proper service; inside the service (of course, it depends on the language), you will have exceptions, and in the contract to the outside, you will have errors, considering HTTP/REST interfaces, for instance.
Some languages have both, or you could have different frameworks that handle other things. For instance, when you use a centralized log solution, you can log exceptions and errors; IMHO, you should leverage exceptions as much as possible because they have more context due to a stack trace. One common mistake is to log exceptions incorrectly, resulting in having only the message on the centralized logging solution, so you need to make sure you are sending the stack trace to the centralized log solution.
Exceptions should never be used for flow control(thus, it is an anti-pattern). We need to be careful when “translating” internal exceptions to external errors because that’s when bad things can happen. Be cautious when catching just one type of exception. Ideally, you should get the most high-level exception possible to avoid slowing errors.
Avoid swallowing exceptions unless they are not errors. IMHO, when a user types the name of a person to perform a search, let’s say: “Matug534rht78934ht7980123” if this person does not exist, this is not an error; it is the fact that the user searches for something that does not exist. IMHO, it is OK to return an error code, usually 404. However, I recommend not to log such exceptions to a centralized log solution because there is no action for you, which would only create noise.
False Positives and False Negatives
When doing error handling, we can have 4 scenarios.
True negative is when it is not an error/exception. True positive is when there is an error/exception. False positive is when it looks like an error/exception but is not. A false negative is when it seems like that is not an error but actually is.
Stack traces are good for troubleshooting and investigation, but you do not want to be investigating at all times; you need to be investigating when you don’t know what’s happening; most of the time, it should be knowing exactly what’s going on. What does this mean? It means that, as much as you know your application/service when it works, you should know when it does not. Some people call this failure mode; you must know how your application can and precisely when each failure is happening (which is better handled with testing).
领英推荐
Signal vs Noise
Signals can bring clarity and meaning. While noise is just obscurity or a mystery. When observing your services in production, you want to know immediately what’s going on. You want to maximize understanding, so you want to know what’s going on very fast. IF your service throws hundreds to thousands of exceptions daily, it will be hard to make sense of it. That’s why you want to monitor it very closely and improve error handling and observability daily.
As I said before, it’s great to have stack traces in a centralized log, but the more you need to use them, the less signal you have. You should have a proper exception metric that clearly signals what is going on.
Nature of Computation
Service cannot be that different at the end of the day. There are just a few other patterns of computations and things that can be happening; here is one example.
RPC Call: The most common service that performs RPC calls to other services. So the most common scenarios of error handling here are:
Async / FF: Software that is async or fires and forgets requires internal monitoring because the consumer or caller is not waiting for a direct answer. Again, look for timeouts, but here, we can also count on success and errors. What was the last time it ran? With success and with errors?
Event-Driven / Webhooks: When things are event-driven, you might not know when they will run; if they fail, you might not have a direct link with a user (vs. an RPC call from the browser). So, I would give the same advice as for the Async/FF workloads. Again, look for timeouts, but here, we can also count on success and errors. What was the last time it ran? With success and with errors?
Batch / Queues: People often use the word batch without meaning batch; batch means we process things in groups, i.e., 100 records, 1k records, 10k records, etc. Very often, people have batch 1 and call it batch :-) Besides that, it is always a good idea to measure arrival and departure rates for queues.
Improving Error Handling
Error handling can be improved; here are some practices you can do to get it better:
Research shows that most catastrophic failures can be avoided with simple testing.
Cheers,
Diego Pacheco
Originally published at https://diego-pacheco.blogspot.com on September 16, 2024.