The Bulletproof Maintenance Part 5: Validation and Discipline

This article is Part 5 in the 6-part series "The Bulletproof Maintenance Window". For the rest of the story, see the links at the bottom of the page.

You've completed the configuration portion of the maintenance window and were able to solve the unexpected issues that came up in the course of your work. Congratulations - you have completed between 35% and 50% of the work.

Validation is the mechanism that gives you and the business confidence that you accomplished what you set out to do. Sometimes, that means just not making things worse than before you started the change. Sometimes, it means that the system is in a better place. In the ideal case, it means that life will be better for the consumers of the applications as well as the engineers supporting the infrastructure. As network and infrastructure engineers, the price that we pay for good validation is personal discipline.

"Standards of Evidence"

How do you know that everything is healthy after you've finished your work? One approach that has served me well is what you could call the "standards of evidence" approach - it's an idea I've borrowed from the practice of Law. The idea is that there are certain standards that evidence must rise to in order to meet the legal burden of proof. Put another way - if you were called on in a court of law to prove that you left the network in a working state, what evidence could you offer to support your testimony? In network operations, the network is usually guilty until proven innocent. Your task when you approach validation is to obtain evidence that proves innocence in the face of some later problem.

Obtain Evidence

How do we obtain this evidence? If we want our maintenance work to be "bulletproof" we have to move beyond "nobody is screaming, so we must be done here". Here are some questions you might ask yourself when planning the validation portion of your work.

What kinds of state in a device could signal that things are healthy? Some examples are:

  • The number of IGP neighbors is an expected value.
  • The NAT table has been correctly dynamically populated.
  • The number of link-layer discovery neighbors and the names of those neighbors are expected values.
  • The members of affected Link Aggregation Groups are in an expected state.
  • The number of BGP peers and numbers of prefixes advertised to and received from those peers are expected values.

What kinds of state in a device could signal that things are unhealthy? Some examples are:

  • Any of the above are unstable over a period of time.
  • A device is stuck in a boot loop after a code upgrade.
  • Performance monitoring indicates an unexpected degradation of service as compared to the device's historical performance.
  • Performance monitoring indicates an unexpected degradation of service as compared to the application's historical performance.

These are just some examples meant to illustrate the principle - you can probably come up with at least 20 or 30 other points of validation just by thinking about problems you have experienced in your own career.

Once you have thought through what to check, you have to do two critical things:

  • Document the state of the affected devices before the work begins
  • preferably in a plaintext format so it's easy to diff between the pre-maintenance and post-maintenance snapshots
  • Document the state of the affected devices when you feel that you have completed all of the configuration work

Remember, we are collecting evidence here - hard facts that cannot be disputed by an expert witness. Once you have collected this evidence from all affected devices, you can diff between the pre- and post-maintenance snapshots. This diffing exercise is quite effective for detecting small issues that lead to later outages. For example, imagine that you have several hundred BGP sessions on a route reflector. That's going to produce a lot of state data. Imagine what the output from show ip bgp summary would be in that case - it would be very easy to miss a single session going down as a result of your maintenance and not coming back up, if all you are doing is eyeballing the output. But if you diff between pre- and post-maintenance snapshots it'll stick out like a sore thumb.

Discipline

Doing all of this planning for validation can be tedious. Once you have thought through all of the questions about what and how to check to complete your validation, you have to write an entire section of your runbook to capture this data, not once, but twice. It's low-grade clerical work at that point. It's not enjoyable, and doing it requires personal discipline. I believe there is a ledger somewhere in the universe where effort spent in a maintenance window is logged. There is a requirement for a certain amount of work to complete any given maintenance effort, and you get to pick how much of that effort is expended before the window. If you choose not to expend it before the window, you'll have to account for that effort somewhere. Usually, it ends up getting pushed into the middle of the window, forcing you to do complex reasoning and problem solving in the middle of the night when you should be sleeping. Sometimes, the effort waits until the next day when a major outage is in progress. Personal discipline gives you the opportunity to choose to do most of the thinking during the day, and before a critical outage, when you are far more likely to do better work.

Tools

Doing all of this validation requires resources of time and attention that are probably in short supply by this point in the maintenance window. How can you get it all done? The answer is, you can't, because your human brain isn't built for this work. You need machines to do it. We need to automate the gathering and comparing of all of this operational data. Software developers have had this testing discipline for a long time - they know all about code review, unit tests, and CI/CD. Admittedly, the tooling to allow us to do this has historically been lacking for networkers in particular. But times are changing. Tools are available right now to let us delegate some of these validation and testing tasks to the machines:

  • You can use a simple Ansible playbook with the template module and a basic jinja template to consistently record the state of the network before the change, and then run the same playbook after and diff the output. It's not very intelligent, but it'll help you spot the missing BGP session mentioned above.
  • A lot can be accomplished with python via the nornir framework, paired with TextFSM. Nornir abstracts the 'commodity' functions of connecting to a group of devices, parallelism, etc. so that you can focus on controlling and retrieving data from those devices.
  • pyATS and Genie have recently started to gain traction in the community as a framework for running test cases against network infrastructure. This particular project is getting very close to being an easy way to do unit tests for the network.
  • Closed, vertically-integrated network equipment vendors are no longer the only game in town. Network disaggregation is making a dent in the universe as we speak. It is possible, today, to buy a datacenter switch from one vendor and software to run on that switch from another vendor. When that software is Linux, all of the former excuses about tooling and management interfaces melt away. It's still a bridge too far for many environments to go down this road, but it will soon be mainstream, in my opinion.

 

Next, in Part 6, we'll explore opportunities for technical leadership which are hidden inside the work of planned maintenance.


This article is Part 5 in the 6-part series "The Bulletproof Maintenance Window". For the rest of the story, check out the following:

Myra Millward

Retired Teacher/Private Tutor

5 年

Way to go Tom!

回复

要查看或添加评论,请登录

Tom Ammon的更多文章

社区洞察

其他会员也浏览了