?Troubleshooting network problems using a checklist of 22 things
- Assumptions! What is really wrong? Is it the network that is being blamed for something else? Fully describe and detail the issue. The mere act of writing it down, often clarifies matters.
- Kick the tyres and do a visual inspection. With Smartphones being readily available, take pictures. I once went to a factory where there was a problem. Upon inspection, the network equipment was covered in pigeon pooh! The chassis had rusted and the PCB boards were being affected by the stuff. No wonder there was a problem. In another example, which involved radio links. It is difficult with radio links to remotely troubleshoot alignment errors. (I can recall when a heavy storm blew some radio links out of alignment. Until we climbed onto the roof we never realised how strong the wind really was that day!)
- Cabling. Is the cable actually plugged in? Is it plugged into the correct location. Wear and tear on cabling can also not be discounted. As a minimum invest in a decent cable tester. Check for power cable runs that are in parallel to network cables. Check for dust on fibre optic connectors.
- Check the auto negotiation settings. Many problems are as of a result of switch or host setting misconfiguration. Tip: Auto is best! Surprisingly, this is the the biggest current problem in networks. If you have a decent network management tool you can detect these mismatches and even discovery protocols like LLDP and CDP provide visibility.
- Are there packets being dropped? The next biggest problem is often the misinterpretation of bandwidth being used. As an example a 2 Mb/s WAN link will be unable to sustain a load of more than 2 Mb/s. Simple??? Often a techie will look at an hourly usage graph and say that because the graph show an peak of now more than 1 Mb/s there is no problem. WRONG. Data is burtsy which means that a load for a few seconds greater than the available bandwidth will drop packets and result in applications are impacted. The term used to describe the dropping of packets above the available load is called policing. This is mitigated by using shaping. It is crucial to understand that shaping needs to be implemented at a level below the policing rate or else there is no benefit as packets will still be dropped.
- Check the network drivers. Most of the network drivers that are pre-released with the operating systems are not optimal! Visit the NIC (Network Interface Card) manufacturer web site and update.
- Walk through the configuration. Are the IP addresses correct? Are the subnets correct? Is the right VLAN being used? Is the gateway correct?
- Changes. Compare and determine differences. Firewall rule changes are often candidate changes for review. (And don't discount desktop firewalls!) When reviewing changes consider: a) What: conditions, activity, equipment; b) When: schedule, occurrence, status; c) Where: local, environment; d) How: practice, actions, procedures; and e) Who: personnel, supervision. Review the network documentation. Is what is written there reflected in reality?
- Power! Often network equipment does not start up correctly after a power outage or is adversely affected by brown outs.It might be prudent to restart the equipment to ensure a clean start up.
- Refer to those Release Notes. Somewhere in the world someone has had the same problem as you. Download and read the latest release notes for your network equipment.
- Black holes. It is amazing how common black holes really are in networks and it is usually down to incorrect MTU settings. I can recall a mad day of scrambling around attempting to troubleshoot network connectivity issues when finally narrowing it down to a WAN compression device that was messing with the MTU. Be sure to check all the appliances and netwok devices along the communications path and check the MTU. As more tunelled networks are deployed this issue will occur more often. (I was recently phoned by a pal who had the issue on one of his customer's networks between some old 3Com kit and a Cisco WAN. Everyone has gone down the wrong path in trying to troubleshoot the issue before I suggested he check the MTU. Voila! Problem solved).
- Sniff. Wireshark's powerful features make it the tool of choice for network troubleshooting. Load the software and capture a copy of the packets involved in the problem. This forms the basis of any extended analysis.
- Are the router tables correct? "show ip route"
- Is the bandwidth being saturated? FTP and email are bandwidth killers and the usual suspects.
- Spanning tree. Spanning tree must be setup in a deterministic fashion and not in a default manner. And hubs in a switched network or disasters in waiting. Also make sure a techie hasn't left a span port enabled and then reallocated it later.
- QoS settings. Have the correct bandwidth allocations been made and are they correct end to end?
- Hacking and pseudo hacking. You have hackers and then those that pretent to be hackers. Those vulnerability scans often cause more trouble than what they are trying to prevent. Death by shooting squadron at dawn is the only punishment for Infosec folk doing vulnerability scans across a WAN link especially during the day.
- Service provider finger pointing. Never trust a carrier or service provider when their lips are moving. I should know...
- Name resolution. Is name resolution working correctly?
- Complexity. Often network engineers try to show the worth of their big pay packages by designing complex networks. The true worth of a good design is if it is normalized and taken down to its most simple form. A simple network is less likely to go titsup.
- Broadcasts. In many cases too many nodes are installed into a single VLAN or broadcast domain. Has the LAN being correctly structured and designed?
- Pre-empt the issue. Fundamentally, this requires a good network configuration management tool and continuous reviews. Is this being done pro actively?
This is shorter version of the longer and more extensive Network Troubleshooting Checklist.