We've had a number of discussions at
Veeya
lately about raising folks up in technology troubleshooting, which we've defined as the ultimate mastery of a skill. I just experienced this situation (might make a great video...)...and I'm asking you right from the start: how would you use this to TRAIN someone? Or is the only method to create a physical lab environment where you put something like this in there (ugh...painful to create)?
So, I welcome you to the inner monolog of my brain in troubleshooting this issue.
Situation: Partial Network Outage - multiple devices monitored at a site in
Paessler GmbH
's PRTG go down: Three WAPs, an access card reader, a switch, and A/V equipment.
Initial "gut level" Reaction: switch failure or network loop.
Troubleshooting Process (not saying this is right, this is just what I did):
- Physically go to the switch. It's running.
- Unplug one of the "down" WAPs on the switch and plug it back in. It comes up for a moment in PRTG and goes back down.
- Check the bandwidth utilization on the switch uplink port (it's a
Ubiquiti Inc.
switch, so I just used the handy little display) - it's a steady 35-45Mbps. Because the site I was at had few users at the time (pre-sunrise, early morning), there should be no traffic at all... perhaps IP cameras? Not enough to generate that.
- I suspect a loop of some sort; connect a laptop to the switch (ethernet cable) and open Wireshark...tell me happy sniffer, what's going on here? Immediately the display floods with data. Within 10 seconds, I'm at 486-some thousand packets
- A weird thought shoots through my mind. A jedi knight waving his hand and saying, "These are not the network loops you were looking for." What is this? mDNS flood? But it's all from one device: 192.168.1.163? Who is that?
- Then I realize, I don't care who that is. They appear to be causing the issue and must die.
- I ping the IP address from my Wireshark laptop, determine the MAC address. View the MAC address tables on the switches to track it down to a port.
- Because I'm onsite, I literally unplug the device from the port (pull the patch cable out of the switch).
- BOOM. My phone starts buzzing. PRTG reports two of the WAPs, A/V gear, door reader...have all come back online. One WAP didn't. I reboot it (another PoE cable pull/plug-in). The final WAP returns online.
- I think to myself (not sure how self-thoughts work in inner monolog...so...), "Wait - there could be other things that were taken offline (or partially degraded) by the mDNS flood..." So, I reboot the affected switch entirely.
- Okay...network running again. Good. Now I need to trace this back to the root problem: What on earth was connected to the network cable I unplugged? I trace it down...WHAT?!? A Windows PC??? Huh? ChatGPT...help me here.
- I paste a screencap of the Wireshark capture above and ask ChatGPT to tell me what could be running on the Windows PC that could cause this. It replies...
- My inner monolog thinks to itself (is that possible?): "I'm a network guy...even IF I were to get on this PC and uninstall whatever Apple Homekit doohickey is doing this, what's to say someone else doesn't bring in another device that does the same thing? The network needs to protect itself from this.
- I ask ChatGPT how many mDNS queries would be expected from a PC in a midsize network environment. It answers, "no more than 50 Packets per second."
- Okay. ChatGPT, what would that translate to in kbps. It mumbles some nonsense about average mDNS packet size, number of services, bla bla bla... and finally states, "80kbps is your huckleberry." (okay, I added the huckleberry part...Reminds me to watch a little Tombstone this weekend...Upon writing this, I immediately modify my ChatGPT profile to answer all questions in the tone of Val Kilmer from Tombstone...that will probably last the rest of the day before I stop giggling and get annoyed).
- I go into the UniFI controller and create a Traffic Shaping rule called "mDNS Rate Limit" that throttles multicast traffic to 224.0.0.251 down to 80Kbps. Apply it to ports connected to end devices. Not sure if this will cause issues in the future, but it'll do for now.
- Lastly, for good measure, I add a registry key to the source Windows 11 PC running Apple HomeKit "whatever" (still no idea of the root application) to disable Multicast DNS that I found here: mDNS - The informal informer | f20
. Works like a champ - no more mDNS from that PC.
Total troubleshooting time: 1.5 hours (or so...I grabbed breakfast after the network came back up - strange how appetite is directly related to network outage status).
...and now I ask you again: how would you use this to TRAIN someone? Or is the only method to create a physical lab environment where you put something like this in there (ugh...painful to create)?
Penny for your thoughts! ...or as Doc Holiday would say, "Evidently Mr. Ringo's an educated man. Now I really hate him."
System Administrator
2 个月Well, I don’t think you need to train someone how to think, but instead understand how systems works and they can go and gather clues for themselves. For example if a person doesn’t know wireshark exist they may never find the solution. So it’s just allowing people to know the technology and what is available to them.
Firstly, thank you for the writeup. These kinds of situations are gold for any Network Engineer (though we may hate them at the time!) and we all end up learning invaluable lessons from them each time. Your post allows us to learn from your experience too. As for your question on teaching this, I'd say its easier than you think. What you did here was start at the lower layers of the OSI Model. you determined that the Switch was up, you determined that it was at least locally functional, then you went to capture packets. Thats where you saw the flood of traffic which set off alarm bells and prompted deeper (AI powered!) investigation. I'd say you pretty much followed the OSI Layer to a tee while troubleshooting and that saved you from wasting time elsewhere. This means that to teach this, you would just teach people to troubleshoot in exactly the same manner, physical, layer 2, then Packet Capturing. Most of the time you will corner the issue there. this simple template will solve many issues or at least provide a solid foundation for more troubleshooting and investigation. But I have no doubt you already know this! Solid idea for showing ChatGPT a screenshot of your Wireshark capture though, I'm stealing that one.
Network Engineer at Living Faith Church Worldwide
6 个月This is such an interesting read. I have learnt a thing or two.Kudos.
Information Technology Specialist @ Accurate Controls, Inc. | User Support, Network and Systems Administration
6 个月For a novice, I think the first steps are the most important to really digest. What do you think could be the problem, and how can you check it? After a while, and with more understanding of each system, your initial gut-check of what the problem might be gets sharper and sharper. Until it is completely wrong and you are on one of those multi-day rabbit holes.
Cybersecurity // Customer Care
6 个月Insightful as always, Mr. JC. I love to see you embracing AI. I respect your experience and teaching skills. This is right and you did. As you said, it isn’t right it’s what you did. But it is right, for you and others that agree with you and, most importantly, can develop troubleshooting skills. In my experience people, in general, don’t even have the guts to start this journey, the marvelous path for a troubleshoot. It doesn’t need to be perfect and get the situation fixed in seconds. But ultimately help with the understanding on: What the issue is? What is that we want to achieve? What tools we can use to achieve our goal? We want to make sure others understand our thought process and can benefit from the lessons learned within our troubleshooting steps. Thanks for sharing this and asking the question. I would love to se all this as a video nugget. I lost count of the hours I enjoy watching you. You are hilarious and wise. Miss the histories about the guys with blue shirt and you house with Cisco tech all over, kitchen included. ?? Keep it up, JC. Love you!