#2 Should you let users report issues?
Guillaume Desrat
Polymorphic Sr. Software Engineer (IC/M): Tech Lead, Team Lead, Engineering Manager | Coach, mentor, writer, and hobbyist cooker
This is the third issue of Arranged ROMs (Rational Opinions, Mostly) . After writing about the difficulty of finding documentation last month, I cover below the question of whether you should allow users of your services to report issues, or not. I explain why this is an actual question, which tools and people are involved in managing (and hopefully resolving!) issues, and mitigations to prevent storms of incoming user-reported (and even system-reported) incidents. I end up with an idea combining a great user experience with well-documented issues.
You have services running, and users. They interact directly with your services through user interfaces, APIs, or they consume outputs (reports, emails, videos, ...). Should you let them report issues? This seems a trivial question. Of course you want them to report if something goes wrong. Now, imagine you have hundreds of services, and thousands of customers. Something goes wrong and impact your largest segment of users. Do you still wish each user report separately on the error they face? Are you willing to scale your support workforce linearly with the count of your users?
This article is not about mitigating the likeliness of incidents occurring; it is about how to support users while focusing on the minimum number of well-documented, ready-to-investigate issues. I see two distinct, complementary paths to achieve this:
Context
Why is support needed?
In an ideal world, services don't break in the first place. No users misunderstand a feature for a bug. From the requirements to the implementation, everybody is on the same page: Product, Tech, and UX teams all make the product perfect. No users asks any question, nor report any issues. No need for support!
Wake up, Neo...
Since we live in the real world, services crashes, and become unavailable to users. Dependency services crashes, and become unavailable to others, which operate at best in degraded mode. To make things worse, you can have variations of the above by replacing "users" with "some users, with no idea what is common to all of them", "users with a given role", "users from a specific geographical region", "users who connect at a given time of the day" (and its harder variant: of their day). And so on.
Depending on what your services are, issues can prevent users from working, playing, buying, communicating, ... (non-exhaustive list). For you, this translates into reputation losses, and/or revenue losses. If we consider life-critical services, this can lead to lives loss. Even if not looking at the worst impact, services require support. Remember that users reporting issues uncaught by alarms are working for free to improve your service — they deserve some help.
About incident tracking system
Email-based support quickly shows limitations, even if they're sent and replied to from a shared mailbox. One is that unencrypted email doesn't play nice with confidential information. It's difficult to add metadata to it, like a severity level, attachments (system logs, screenshots, ...). Delay in delivery sucks, too. And delivery of an email is a best-effort protocol, to quote my former colleague Romain. Whatever the size of your team, organization, or company, you need a centralized repository of incidents. It can be the same as the one which you manage work items with (Kanban board, Scrum sprint, waterfall backlog, Shape Up batch — whatever methodology you use).
Incident management systems basically allow you to create, update, and archive incidents. Updates allow all parties to document (as seen previously ) incidents, by writing a description of the issue, posting comments to report on the progress made, and attaching optional files (screenshots, system logs, ...). Incident management systems usually feature a varying set of additional metadata, identifying the service impacted, the severity of the issue (usually a combination of qualitative and quantitative assessments: what's the impact for each user? how many users are impacted?), the team or individual in charge of handling it, and the current incident status. An audit trail reveals the timeline of the incident, what was investigated, which teams participated in its resolution; such trace is useful to write the post-mortem of a large scale incident.
These systems often expose an API for services and alarms to programmatically file incidents. They provide notification endpoints to drive actions in other services. One common use case is to page the on-call engineer of the assigned team. This notification feature also means software engineers can build bots to report on incidents to chat rooms, or update the incident work log when the alarms turn off.
Support isn't free
The other users of incident management systems are the ones who work to understand, then resolve issues. Software engineers can take turn to handle this duty. This usually lasts one week, then the engineer returns to its regular activities. It's considered good practice that engineers share retrospectives of their support week with their team, to identify trends and actions to take to improve the operational excellence of the services. You can also operate with a dedicated team of support engineers, who triage the issues, and resolve most of them relying on their experience and the agreed procedures. If something is off track compared to their run book, they pass it to the software engineers. In the former case, there's a predictable impact on the development bandwidth. In the latter case, it's not — although it's expected to be less thanks to the support engineers. In both cases, support isn't free.
As much as your engineers are willing to support, they need to spend time investigating issues, fixing the code or configuration, deploying changes, and monitoring the resolution of incidents. All of these while communicating regularly with the requester through the incident management system. The less duplicated incidents are created, the less engineers waste time triaging, documenting and closing them. The better the quality of the incident initial documentation, the faster they can investigate.
Mitigation: communication
How do you mitigate the likeliness of a storm of incoming tickets, when a large number of users are impacted, and prone to each file a separate support request?
领英推荐
Human communication
The cheapest approach is proactive communication, though users may not acknowledge it before experiencing the issue themselves, and still report it. Broad communication over email, or chat, may go unseen, however they slightly increase the chance that users read it, or are told about it by colleagues impacted. Such communication shall provide details about the issue, expected resolution time, and where to find live details (in the incident management system, or on an external status page).
Information banners, these user interface components displayed in the applications, can warn users of known issues. They usually contain a text, and a link as a reference to the incident. Since the banner text is sourced from a database, there's a low-hanging fruit in creating and updating them automatically based on the existence of incidents — provided there's an API to do so.
API communication
If your service exposes an API, its response can carry valuable information for the caller. You want to be diligent and provide enough details about the issue, without revealing non-documented features, or the implementation. Be semantic and use the HTTP status codes to indicate the category of errors, and provide additional HTTP response headers to let the caller know what to do, and when to re-attempt. Here are below few examples.
If the query parameters submitted as part of the API call are incorrect (erroneous syntax, format, value), the service shall return an HTTP 400 Bad Request response. Since no official HTTP headers intends to contain the error message, you can use a proprietary one (X-Request-Error, or simply Request-Error), and document it.
If the service is struggling to process requests, and invites the caller to lower its call rate, it shall return an HTTP 429 Too Many Requests response. Sending back the maximum request rate in the response header helps, at least for the caller to understand why they're throttled. If the service is totally unavailable, an HTTP 503 Service Unavailable response is suitable. As the documentation suggests, use Retry-After HTTP header in the response for both cases. Otherwise it's like calling an administration, and being told all lines are busy, or that you call outside of company office hours... and the recorded message abruptly ends without telling when one can call back. Frustrating.
Mitigation: tools
How do you prevent duplicated incidents, and increase the quality of the initial documentation, to investigate issues faster?
Incident creation API: A word of caution
As mentioned previously, services can programmatically rely on an API to create incidents. This is a neat feature as it allows engineers to implement alarms from within services. Where external alarms can report incidents from the outside, with little to no details, services can report incidents from the inside, and provide implementation-level details: execution stack traces, keys of relating database objects — all that is necessary for support to quickly investigate and diagnose the incident. However, exercice this power with caution! You don't want to report on every single back-end action failing: a dependency component or service going down may lead to all corresponding service executions to fail. And since the idea is to reduce the number of incidents flowing to your support squad, this could head you the wrong way. APIs of incident management systems usually accept a unique identifier forged by the requester, to prevent the creation of duplicates. In conjonction with filing an incident, services shall move the failed execution to a queue, for a separate alarm to create an higher severity incident passed a defined threshold.
A dedicated, intermediate service to file incidents
Since your services can interface with incident management systems through APIs, you can spawn a small application dedicated to capturing user support requests. Through a series of questions, and drop-down list boxes, it guides the user to properly (or with more likeliness to do so) fill in all necessary details before shuttling them to the actual incident management system. The creation workflow can help better identify the user-facing service impacted (from the pasted URL for example, or by showing the user screenshots to select from), the time at which the issue happened, check for similar reports already created by their team or organization, before eventually filing the incident on behalf of the user. Such a system lowers the chance of having duplicates by formatting incident title and description. It can also capture additional information, like the alarm status at the time of reporting the incident. One could even connect to the impacted service, fetch technical details in the database, or parse the system logs for what relates to the user within the time range of the incident.
With less incidents, and an higher quality documentation, the support workforce doesn't have to scale linearly with the number of users.
Going further
Since the less controllable vector, both in terms of quantity and quality, is the user, controlling how they file incidents is key. The dedicated service to report issues mentioned above is great because it operates separately from other services. This means that even if yours is unreachable, or totally broken, users can still report it. However, if it's unreachable, or totally broken, chances are you're already aware of it. So this is not where a dedicated tool brings value. Its value comes from better qualifying and describing the issues impacting the users. Users asynchronously reporting precisely what doesn't work for them frees the support workforce from engaging into a chat session to translate the perceived issue into an understandable case with technical details they can investigate. As an example, if users are in a different time zone, it is challenging to schedule a screen share session, and it delays the time to resolve their issue.
Piggy-backing on the idea of the workflow to report an issue, what about a visual clipper to take a screenshot, or a short video, of the page? The user would point at what is wrong or not working in their opinion, and provide a text description of what they experience. The tool would automatically capture the URL, extract relevant state (for JavaScript applications), and send all the details for a back-end service to register the support request. This back-end service, as written previously in "A dedicated, intermediate service to file incidents", could attach system logs to provide all necessary material to investigate the issue.
Unfortunately, the development effort for such a visual incident reporting tool could outweigh the benefits of time saved for a small team. This would rather be a project to develop with the intent of shipping the product to others, or at a large company where the gain correlates to its scale. Or something to start small, and grow gradually, as needed.
Thanks for reading this far
These are my thoughts on letting users report issues, and how to let them do so in a useful way, both for them and for the support team. I would love to know if you use other approaches to let users communicate to you their questions, worries, and problems regarding services you own — please comment below.
Stay tuned for another article on Arranged ROMs next month. Subscribe to the newsletter to be notified when future articles are published.
Polymorphic Sr. Software Engineer (IC/M): Tech Lead, Team Lead, Engineering Manager | Coach, mentor, writer, and hobbyist cooker
1 年For reference, I posted another ripple to this article, where I share an anecdote about a time I seize the opportunity of an issue to strengthen the customer trust into our team: https://www.dhirubhai.net/posts/guillaume-desrat_support-user-human-activity-7109793285740617728-seLx
Polymorphic Sr. Software Engineer (IC/M): Tech Lead, Team Lead, Engineering Manager | Coach, mentor, writer, and hobbyist cooker
1 年For reference, I posted another ripple to this article, where I balance what I wrote here, so you don't go and build a corporate borg: https://www.dhirubhai.net/posts/guillaume-desrat_support-user-human-activity-7003444315272138752-zSLc
Polymorphic Sr. Software Engineer (IC/M): Tech Lead, Team Lead, Engineering Manager | Coach, mentor, writer, and hobbyist cooker
1 年For reference, I posted a ripple to this article, where I relate an anecdote (among many I have!) about a user disputing correctness (and being wrong): https://www.dhirubhai.net/posts/guillaume-desrat_support-user-anecdote-activity-7000778184866549760-yvVN