Troubleshooting

With the mass growth of cloud computing over the last decade (and the design conventions that have arisen around it), it’s easy to forget that there are many “legacy” systems that were either moved into the cloud largely unaltered or remain in the datacenter. These systems reflect the opportunities, visions, and restrictions that the designers faced at the time, and the visibility into them varies wildly.?

Even modern systems have plenty of holes. For instance, a company has built a minimum viable product that generates revenue. It may have experienced high growth. Now that growth has tapered off and the focus is on keeping the current user base happy.? What hasn’t been focused on quite so much is observability and automation.

This article focuses on you, the lucky point person for troubleshooting one of these systems. It explains how to effectively approach the problem, foster a healthy culture around what can easily become a blame fest, and become a respected ‘go-to’ resource for future troubleshooting efforts.

Your system has a problem.

Ideally, the system itself will be the first to alert you to a problem. A red sad face would pop up in the middle of your single dashboard, you’ll click on it to see the definition of the problem and some helpful context, and you’ll be able to swiftly drill down to the cause, enlisting SMEs as you go.

Many systems work like this instead: the red sad face belongs to your customer, you get a Slack from CX saying “Customer X says the website is down. The last time anything went wrong you worked on it, so here you go.” The SMEs don’t work here anymore. Their replacements are enthusiastic and competent but they’ve inherited far too much to know every module inside out. You’ll need them, but they’ll need wrangling.

Your manager’s manager creates a Zoom war room and opens with “Customer X says the website is down. Therefore every part of our high availability, distributed system is down. I’m messaging the CEO to return from their vacation on Micheal Skellig immediately. This is an all-hands situation. How soon can you fix this?”

Your company never quite made the transition to that single dashboard, so you’re looking at multiple Zabbix portals, a Grafana page that summarizes some of them, a bespoke status page that doesn’t cover your newer systems, your email history, and some terminals spewing out trace-level logging.

You really have two problems: one is in the system, and the other is managing the people who are watching you very closely and by proxy, managing the people who are watching them

External Communication Part 1

You may have an Incident Manager who can handle communication, or that person may be you. Either way, even before you scope out the problem, let your points of contact know that you’re on the case and that you’ll update them once you can broadly categorize the problem. That sets an expectation and gives you room to do your job.

As you come to understand the problem, keep your contacts in the loop and move to update them on a set schedule rather than as discoveries unfold.

Keep a timeline

You have to start by assuming the problem is real and big, which means you need to treat it as a company-wide incident with a Root Cause Analysis to follow. Any company worth its salt is going to want to see how to improve the response for next time, and for that, they’ll need to know what happened when, and if it could happen sooner. One easy way of tracking time and action is simply a chat group. Drop in a word or two as you do stuff (“restarting server X”). Keep CX out of there unless they can be trusted. Again, theories can become facts in the wrong hands.

Scope It

Now you need to gather data to scope the problem. There are few more abused terms in tech than “down”, “crashed”, and “hung”, so start out with the easy questions

External Scoping

Is this a singular report?

  • Are there any other users affected?
  • If your system is structured that way, are only users for a particular company affected?

Did this ever work?

  • Is this a new user??
  • Is the user exercising a new feature?

What was the user doing, and when?

  • A false assumption here can derail the investigation, so try to get first-hand evidence of the problem rather than someone’s interpretation of it. That means screenshots, a video call, and client-side logs if you can.?

Internal Scoping

What is your internal monitoring (and your cloud provider, if applicable) telling you? If it’s telling you nothing, then this is a good opportunity to kick off some parallel investigation from the client side. That’s going to depend a lot on how cooperative the user (or the company’s IT team) is feeling, and you may find yourself troubleshooting their systems for them. It can be taxing, but it’s a way to control some of the variables.

External Communication Part 2

Don’t muse on what the cause could be with your CX contacts. They’re keen to communicate good news to the clients, and your blue-sky thoughts can be taken by the client as facts. Don’t reveal the internal names of your software processes; those can become the scapegoats for every near-future communication from the client.

Bottom Up, then Top Down

If you can make a solid correlation between what your internal monitoring is telling you and what you’re hearing from the client side, then your direction is likely set – off you go.

Otherwise it’s time for some more holistic questions. The big one is “What changed recently?” That can mean a new software deployment, an infrastructure change, a configuration change, a database schema change, node restarts – you get the idea. Hit up that change tracking system.?

What you do after that depends partly on the maturity of your CI/CD system (if you even have one) and whether you can roll back a feature deployment. Changes that are more infrastructure-oriented (k8s changes, network hardware changes) involve cross-team discussion.?

Independent of the system, though, is the feeling you have about what you’re doing. Once you’ve gathered the initial data, the troubleshooting effort should lead you in a certain direction, and more data gathering should reinforce that direction. If it doesn’t, then you don’t have all the information you need, or you’re headed in the wrong direction. Always be ready to challenge the assumptions that you’ve made and reassess. Listen to that voice in your head (or on the Zoom call) that says something doesn’t fit.?

Culture

Sometimes the outcome of a troubleshooting investigation is human error. It’s really a failure of the process. There should be a procedure to follow for every change. Production systems should never allow changes without validation. Any changes should be exercised in a beta environment. The changes should be thoroughly reviewed.

That’s usually the domain of the ideal system mentioned above, and humans are fallible. Once you’ve nailed down the problem, part of the Root Cause Analysis should be what can be practically done to avoid it next time. Whatever you come up with, take care to stay dispassionate during the process. Next time the human error could be yours.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了