Good Exception Handling - Chapter 1
Prasad Edlabadkar
Sr. VP & Head of Engineering @ RAKBANK | Open Banking, AI, Digital Transformation
I have been designing and implementing software systems for many years now and one thing I found consistently missing throughout in all the designs is lack of enough thought on "how would operations teams work with this?". Typically, design and development of any application lasts a few months. But, the operations / production support team have to maintain it for many years. Sometime decades. Yet, teams invest very little time in design and implementation of constructs that help operations teams effectively maintain and troubleshoot the application. This results in prolonged incident resolution time, wasted effort to identify root cuase, detecting issues after they have caused damage etc.
In this series of articles, I try to articulate some of the key aspects that you can consider while designing the application to improve efficiency of the operations team, reduce number of support calls and become more proactive in identifying problems before they occur.
As a side effect, these practices will enable development teams to better debug applications during test phases.
Why Bother About Good Exception Handling?
No software system is perfect. Every software system will often throw an error. Important question is, how do you handle these errors? What does "handle" exception mean? What exactly do you do when you "handle" an exception? These are some of the key questions to answer when dealing with exceptions.
The answer would vary depending on who you ask. A developer, would say "We catch and log the exception" or "We convert the exception into a meaningful error message for the consumer". Yet, it is usually difficult to understand mechanisms available for support team to get work with exceptions. In most cases, a monitoring agent picks up logged exception and sends to a central logging system. This central system can then raise alert for the support team. Is that enough? What does production support team do with this alert? How do they know what action to take? Unfortunately, there isn't any easy answer for this. Over a period of time, most teams would learn to make sense out of an exception alert. But, this experience is not persistent. Organizations loose this experience with people moving teams.
To avoid these challenges, development teams must invest good amount of time in exception handling to;
If done correctly, this can reduce number of incidents reported. At the time, it will result in reduced support team bandwidth.
In this article, we will explore importance and benefits of meaningful error messages. We will discuss more aspects of exception handling in subsequent articles.
Note: Examples below may seem oriented towards Java as the programming language. However, same strategy can be applied to any programming language or software platform.
Return Meaningful Error Messages
How many times do you see applications that return an error saying "Unable to process request. Please try again later" or "Unknown error" or "Server returned an error response"? If these messages are displayed on a user interface or returned as part of a API response, the consumer will never be able to understand what went wrong. Even worse, they will not understand what they can do recover from the error. Thus, they either rety many times without changing anything or report it via support channels. Retrying without changing anything will result in more failures and frustration. While, reporting to support channel will increase load on the support team. Instead, what if the error message describes what went wrong and how can consumer recover from it? It would make both consumer's life and support team's life very easy. To understand this better, let's look at some scenarios of good and bad error messages.
Scenario 1: Underlying backend service is down for maintenance.
This error message doesn't tell consumer anything about the error or how they can recover from it. This type of error message guarantees increased frustration or number of suppor team calls or both.
With good error message, consumer gets actionable insight about downtime. This prevents them from retrying that may result in failure and frustration.
Scenario 2: Validation Failures
领英推荐
This error message doesn't provide information about invalid data or what consumer needs to do to recover. This will result in them retrying the request with trial and error method. This will result in increased load without any business value and eventually call to the support team.
With good error message, consumer knows exactly what they are missing and what they need to do. Again, the theme here is actionable insight.
Scenario 3: Unknown Errors
Not all exception can be checked. At some point, your code will throw an unknown exception that is not handled. In such case, it's even more important to correctly frame your error message. This will avoid consumers getting frustrated. Typically, these are "catch-all" type of errors where you return a standard hardcoded error message. But, in this case the consumer doesn't know what to do. Your error message only tells them to contact support. So, you can expect high call volume if the issue is systemic.
Instead, you can change the error the message to;
This is still a standard message. But, it reduces consumer support calls since they know there is someone already working on it. Yet, it is, important to have good alerting mechanism for the support team. They must get notified about this problem immediately and start working on it as soon as possible. It is also good to have a systems availability page to show operational status of various systems.
Scenario 4: Context In Business Error Messages
Your code will perform number of business validations before processing the request. It will send these validation failures back to consumer for correction. But, many times, these messages lack context based information and are "static" but business defined messages. They need some level of system understanding and experience to fully understand and correct the errors. This results in several retries or calls to support channels.
Below is an example of one such message
This error message does highlight the problem. But, it doesn't tell the consumer exactly who is not a valid user and what they need to do become one. In absence of such information, the consumers will have to call support channels. This results in increased support call volume. Instead, the you can tailor the same error message like below to prevent this situation. The error message assumes that approver is not the valid user while requester is.
This message now pin points exact issue and is actionable for the consumer. You can go step further to provide link to a request form for even more simplification. This avoids any support calls and enables consumers to act in a self-serve manner.
There may be many such scenarios in practice that you can easily handle by thinking about error messages as a mechanism to reduce support call volume. They should also provide actionable insight to the customer about the failure.
Up Next
In the next chapter we will discuss about design of exception handling framework that will simplify developer's life as well as allow managed evolution of exception handling over time.
Sr. VP & Head of Engineering @ RAKBANK | Open Banking, AI, Digital Transformation
1 年Here is my own bank showing me this error ??