Troubleshooting War Stories: Part 2 - Reviving the Corporate Core Switch
Mohamed Bashir
NFV/SDN Engineer at Almadar Aljadid | 2xCCNPs(SEC-ENT) | CCNA | 2xVCPs(DCV-DW) | NSE 1-4 | MCSA | JNCIA-Junos | CNSS | AZ-900 | HCIA R&S | IC3
Introduction
This is the second article in our "Troubleshooting War Stories" series, and this one will be shorter and dive into an incident with the corporate core switch.
I recently received a call from a network administrator at a major government corporation. He told me their core switch had gone down. When I asked if they had a backup switch, he said it had been faulty for a while. Now the active core switch had suddenly failed, crippling the entire organization.
The solution to reviving this core switch was unbelievable. Let's take a closer look at the infrastructure and the troubleshooting case.
How It's All Connected
This government organization's network has a 2-tier network design for a corporate headquarter. There are 6 floors, and each floor has an access-layer Cisco switch.
Each of those access switches has two fiber uplinks - one going to CORE-SW-1, and the other to CORE-SW-2 in the data center. This redundant core design is meant to prevent a single point of failure.
However, as the network admin mentioned, the backup core switch (switch 2) had been faulty for a while. So, when the active core switch (switch 1) suddenly failed, it crippled connectivity across the entire 6-floor headquarters.
The Troubleshooting Process
The entire 6-floor headquarter was offline. All the employees were cut off from critical apps, email, phones - everything was down.
领英推荐
I first logged into the failed primary core switch, CORE-SW-1. When I checked the switch, I found that it was in recovery mode and all the access switch links were down and, in an error-disabled state.
Upon further investigation, I discovered that the ports had been error-disabled due to persistent link flapping on those access switch uplinks. The ports were rapidly transitioning between up and down states, so the core switch had automatically placed them in the error-disabled state to prevent forwarding of any unstable traffic.
I tried checking the physical connectivity and also made various Layer 2 configuration changes to try to resolve the issue, but nothing worked. As a last resort, I even rebooted the core switch, but that didn't bring it back online either. At this point, I agreed with the network admin that we needed to clean the switch from the accumulated dust using an air blower tool.
We have cleaned the interfaces, fans, and power supplies first and powered it on, but the switch still wouldn't come back online. So we then removed the cover and thoroughly cleaned the switch board from the inside. After this, we powered on the switch, and it finally came back online.
Conclusion:
This incident shows the importance of regularly cleaning network equipment, even if it looks clean on the outside. Despite cleaning the external parts, the core switches still wouldn't work until we thoroughly cleaned the internal components.
Cleaning the SFP slots to prevent dust buildup between the SFP modules and the switch board is also crucial. Regular internal maintenance is essential to prevent dust-related issues and hardware failures that can bring down the whole network.
Head of Network Operation Security Unit
9 个月Hidden in plain sight ?? Once again, professionals have standard ????