Good Exception Handling - Chapter 1
https://wallpaperbat.com/system-error-wallpapers

Good Exception Handling - Chapter 1

I have been designing and implementing software systems for many years now and one thing I found consistently missing throughout in all the designs is lack of enough thought on "how would operations teams work with this?". Typically, design and development of any application lasts a few months. But, the operations / production support team have to maintain it for many years. Sometime decades. Yet, teams invest very little time in design and implementation of constructs that help operations teams effectively maintain and troubleshoot the application. This results in prolonged incident resolution time, wasted effort to identify root cuase, detecting issues after they have caused damage etc.

In this series of articles, I try to articulate some of the key aspects that you can consider while designing the application to improve efficiency of the operations team, reduce number of support calls and become more proactive in identifying problems before they occur.

As a side effect, these practices will enable development teams to better debug applications during test phases.

Why Bother About Good Exception Handling?

No software system is perfect. Every software system will often throw an error. Important question is, how do you handle these errors? What does "handle" exception mean? What exactly do you do when you "handle" an exception? These are some of the key questions to answer when dealing with exceptions.

The answer would vary depending on who you ask. A developer, would say "We catch and log the exception" or "We convert the exception into a meaningful error message for the consumer". Yet, it is usually difficult to understand mechanisms available for support team to get work with exceptions. In most cases, a monitoring agent picks up logged exception and sends to a central logging system. This central system can then raise alert for the support team. Is that enough? What does production support team do with this alert? How do they know what action to take? Unfortunately, there isn't any easy answer for this. Over a period of time, most teams would learn to make sense out of an exception alert. But, this experience is not persistent. Organizations loose this experience with people moving teams.

To avoid these challenges, development teams must invest good amount of time in exception handling to;

  • Recover from an exception by making end user aware of the problem. Also, what they can do to proceed further without contacting support team. This will help reduce call volumes to support team.
  • Notify production support team about the exception and actions they need to take to recover from that exception.

If done correctly, this can reduce number of incidents reported. At the time, it will result in reduced support team bandwidth.

In this article, we will explore importance and benefits of meaningful error messages. We will discuss more aspects of exception handling in subsequent articles.

Note: Examples below may seem oriented towards Java as the programming language. However, same strategy can be applied to any programming language or software platform.

Return Meaningful Error Messages

How many times do you see applications that return an error saying "Unable to process request. Please try again later" or "Unknown error" or "Server returned an error response"? If these messages are displayed on a user interface or returned as part of a API response, the consumer will never be able to understand what went wrong. Even worse, they will not understand what they can do recover from the error. Thus, they either rety many times without changing anything or report it via support channels. Retrying without changing anything will result in more failures and frustration. While, reporting to support channel will increase load on the support team. Instead, what if the error message describes what went wrong and how can consumer recover from it? It would make both consumer's life and support team's life very easy. To understand this better, let's look at some scenarios of good and bad error messages.

Scenario 1: Underlying backend service is down for maintenance.

No alt text provided for this image
Example of bad error message

This error message doesn't tell consumer anything about the error or how they can recover from it. This type of error message guarantees increased frustration or number of suppor team calls or both.

No alt text provided for this image
Example of good error message

With good error message, consumer gets actionable insight about downtime. This prevents them from retrying that may result in failure and frustration.

Scenario 2: Validation Failures

No alt text provided for this image
Example of bad validation error message.

This error message doesn't provide information about invalid data or what consumer needs to do to recover. This will result in them retrying the request with trial and error method. This will result in increased load without any business value and eventually call to the support team.

No alt text provided for this image
Example of good validation error message

With good error message, consumer knows exactly what they are missing and what they need to do. Again, the theme here is actionable insight.

Scenario 3: Unknown Errors

Not all exception can be checked. At some point, your code will throw an unknown exception that is not handled. In such case, it's even more important to correctly frame your error message. This will avoid consumers getting frustrated. Typically, these are "catch-all" type of errors where you return a standard hardcoded error message. But, in this case the consumer doesn't know what to do. Your error message only tells them to contact support. So, you can expect high call volume if the issue is systemic.

No alt text provided for this image
Example of bad unknown error.

Instead, you can change the error the message to;

No alt text provided for this image

This is still a standard message. But, it reduces consumer support calls since they know there is someone already working on it. Yet, it is, important to have good alerting mechanism for the support team. They must get notified about this problem immediately and start working on it as soon as possible. It is also good to have a systems availability page to show operational status of various systems.

Scenario 4: Context In Business Error Messages

Your code will perform number of business validations before processing the request. It will send these validation failures back to consumer for correction. But, many times, these messages lack context based information and are "static" but business defined messages. They need some level of system understanding and experience to fully understand and correct the errors. This results in several retries or calls to support channels.

Below is an example of one such message

No alt text provided for this image
Example of bad business error message

This error message does highlight the problem. But, it doesn't tell the consumer exactly who is not a valid user and what they need to do become one. In absence of such information, the consumers will have to call support channels. This results in increased support call volume. Instead, the you can tailor the same error message like below to prevent this situation. The error message assumes that approver is not the valid user while requester is.

No alt text provided for this image
Example of good business error message

This message now pin points exact issue and is actionable for the consumer. You can go step further to provide link to a request form for even more simplification. This avoids any support calls and enables consumers to act in a self-serve manner.

There may be many such scenarios in practice that you can easily handle by thinking about error messages as a mechanism to reduce support call volume. They should also provide actionable insight to the customer about the failure.

Up Next

In the next chapter we will discuss about design of exception handling framework that will simplify developer's life as well as allow managed evolution of exception handling over time.

Prasad Edlabadkar

Sr. VP & Head of Engineering @ RAKBANK | Open Banking, AI, Digital Transformation

1 年

Here is my own bank showing me this error ??

  • 该图片无替代文字
回复

要查看或添加评论,请登录

Prasad Edlabadkar的更多文章

  • Can you really remove technical debt?

    Can you really remove technical debt?

    By: Saket Saith & Prasad Edlabadkar In software development, technical debt is the implied cost of additional rework…

    5 条评论
  • Microservices & event sequencing

    Microservices & event sequencing

    Events are key drivers behind distributed transactions in microservices. However, with these events, there is one…

    1 条评论
  • Bring Your Own App (BYOA) & Appless experience

    Bring Your Own App (BYOA) & Appless experience

    During late 90s and early 2000, there was a boom of websites or what was known as "dot com companies" where everyone…

    5 条评论
  • Uncovering API Implementation

    Uncovering API Implementation

    APIs have been around for a while across all the industries. For most organisations, APIs have become synonymous to…

    5 条评论
  • PSD2 - Opportunity for auditors

    PSD2 - Opportunity for auditors

    Payment Services Directive 2 (PSD2) compliance date is round the corner. By Jan 2018, banks (ASPSP) will have to comply…

  • Open banking read write APIs – Are they really customer friendly?

    Open banking read write APIs – Are they really customer friendly?

    The Competition and Markets Authority (CMA) in the UK has released a second set of APIs named Read/Write APIs. These…

社区洞察

其他会员也浏览了